Scientists Use DNA To Store Digital Data

Our planet now has about 10 trillion gigabytes of digital data, and every day humans generate emails, photos, tweets, and many other digital files that add another 2.5 million gigabytes of data. 

Much of this data is stored in massive facilities known as “exabyte data centers” (exabytes = 1 billion gigabytes) that are the size of a football field, and cost about $1 billion to build and maintain.

Storing all the data of the world in a cup of DNA

Many scientists believe that the alternative solution lies in the molecule that contains our genetic information, RNA, or what is known as “DNA”, which can be developed to store huge amounts of information at a very high density. 

In this context, Mark Bathe, professor of biological engineering at the Massachusetts Institute of Technology (MIT), says that “a mug of coffee filled with DNA could theoretically store all the world’s data,” as recently reported, citing Institute platform that published the research.

Bathe explains “We need new solutions to store these huge amounts of data that the world is producing and assembling, especially archival data, and RNA is a thousand times denser than flash memory, and another exciting property is that once you make the DNA polymer, it doesn’t consume any energy, and you can write DNA and then store it forever.”

Scientists have already proven that they can encode images and text pages in the form of RNA. However, there is also a need to find an easy way to access the required file from the many overlapping pieces of DNA, and this was a complex problem that scientists faced in the past, but Bathe and his colleagues solved this problem and found a way to do so by encapsulating each specific data file in a 6-μm silica particle, which was labeled with short DNA sequences that reveal its contents.

Using this method, the researchers showed that they could precisely pull out individual images stored as DNA sequences from a set of 20 images. Given the number of possible labels that could be used, this approach could scale up to 1020 files.

Stable storage

Digital storage systems encode text, images, or any other type of information as a string of 0, 1 or bits and bytes, and this same information can be encoded in DNA using the four nucleotides that make up the genetic code: A, T, G, C. 

For example, G and C could be used to represent 0 while A and T represent 1.

DNA has many other features that make it desirable as a storage medium: it is very stable, easy to use (albeit expensive), and because of its high density it saves a lot of space, 1 exabyte of stored data is barely 1 nm cubic, Which you can fit in the palm of your hand without feeling it instead of a huge football field.

One of the major obstacles to this type of storage is the high physical cost, with the cost of writing one petabyte (one million gigabytes) of data currently at about $1 trillion. 

To become a competitor to magnetic tape, which is often used to store archival data today, the cost would have to drop dramatically, and Bathe expects this to happen within a decade or two at the latest.

The main obstacle researchers faced

Aside from the cost, the main obstacle the research team has faced in using DNA to store data is the difficulty of finding the file you want among all the others.

“Assuming that the technologies for writing DNA get to a point where it’s cost-effective to write an exabyte or zettabyte of data in DNA, then what? You’re going to have a pile of DNA, which is a gazillion files, images or movies and other stuff, and you need to find the one picture or movie you’re looking for,” Bathe says. “It’s like trying to find a needle in a haystack.”

Currently, DNA files are conventionally retrieved using PCR (polymerase chain reaction). Each DNA data file includes a sequence that binds to a particular PCR primer. To pull out a specific file, that primer is added to the sample to find and amplify the desired sequence. However, one drawback to this approach is that there can be crosstalk between the primer and off-target DNA sequences, leading unwanted files to be pulled out. Also, the PCR retrieval process requires enzymes and ends up consuming most of the DNA that was in the pool.

What is the solution to this dilemma?

As an alternative approach, the MIT team has developed a new retrieval technology that involves encapsulating each coil stored in DNA in a small silica capsule. Each capsule is encoded with single-stranded DNA “barcodes” corresponding to the contents of the file, and these codes are the name of the capsule contained in the file.

To demonstrate this approach in a cost-effective manner, the researchers encoded 20 different images into pieces of DNA about 3,000 nucleotides long, which is equivalent to about 100 bytes. (They also showed that the capsules could fit DNA files up to a gigabyte in size.)

The result was astonishing. The raw materials were labeled with fluorescent or magnetic particles, making it easy to pull them out and make sure they match the required coil, and then pull or open that coil while leaving the rest of the DNA intact for return to storage. This search process allows typing words such as “President, America, the eighteenth century” to be President George Washington, which is the same as what is currently done while searching for such words in the Google search engine.

For their barcodes, the researchers used single-stranded DNA sequences from a library of 100,000 sequences, each about 25 nucleotides long, developed by Stephen Elledge, a professor of genetics and medicine at Harvard Medical School. If you put two of these labels on each file, you can uniquely label 1010 (10 billion) different files, and with four labels on each, you can uniquely label 1020 files.

A giant leap in search technology

George Church, professor of genetics at Harvard Medical School, describes this technology as “a giant leap in knowledge management and research technology.”

“The rapid progress in writing, copying, reading, and low-energy archival data storage in DNA form has left poorly explored opportunities for precise retrieval of data files from huge (1021 byte, zetta-scale) databases,” says Church, who was not involved in the study. 

“The new study spectacularly addresses this using a completely independent outer layer of DNA and leveraging different properties of DNA (hybridization rather than sequencing), and moreover, using existing instruments and chemistries.” He added

It may take some time for the financial cost of this amazing method of storing digital data to come down, but it is definitely coming in the near future.

It remains to be mentioned that the research team that achieved this amazing achievement consists of Professor Dr. Mark Bathe as the team leader, researcher James Bandall from the Massachusetts Institute of Technology, Associate Professor at the Institute Watson Shepherd, and graduate student at the Institute Joseph Berlant.