They say that on the Internet, nothing ever goes away.
This is true, for popular content that’s endlessly shared and remixed. But this kind of viral candy is only the tip of a really big iceberg. Beneath the surface of memes and naked celebrities lurks endless petabytes of data too boring for immortality. Wedding pictures, college essays, home videos, old emails — most of our data is in this category, and it can be disturbingly ephemeral.
One of the major lessons from physics is that thermodynamics hates your guts. Computer data is no exception. Flash memory loses its charge in under a decade. Even under ideal conditions, magnetic hard drives won’t last longer than about ten years. CD’s, under ideal conditions, last about ten as well. Magnetic tape, the gold-standard for long-term data storage for industry, stops being readable after thirty to fifty years.
How Digital Data Dies
This poses a problem, because it makes data storage take effort. Anything that’s not interesting enough to actively preserve from hard drive to hard drive, cloud service to cloud service, simply ceases to exist. 99% of our data is simply being thrown away, into landfills and failed Internet companies. Even for the data we do care about, the prognosis isn’t good.
Consider the problems posed by data compression. In order to save storage space and bandwidth, we often use file formats (like .jpg and .mp4) which compress their contents in some way. The compression algorithms used come in two general types: lossless and lossy.
- Lossless formats eliminate redundancy, identifying chunks of the file that repeat and replacing them with shorter descriptions. This allows you to reconstruct the original file perfectly later, but can only compress the data so much (check out the link above for a visual metaphor of how these algorithms work).
- Lossy formats are much more powerful, but come with major tradeoffs. Lossy formats work by discarding some of the information about the original file, in order to be able to encode the file in less space. These algorithms can’t precisely reconstruct the original file, but they’re tuned such that the information that gets dropped tends to be information that people don’t notice. These algorithms can get a spectacular reduction in file size with only a small drop in visual quality, and are used for nearly all audio, video, and pictures.
This is generally a good thing: it allows us to download much higher-quality content much faster than would be possible if we were stuck using lossless formats. However, there’s a dark side to lossy formats, and it looks like this:
When you re-encode a file into a lossy format, data is lost. Converting a lossy format to another lossy format doubles the damage. The above video was generated by repeatedly converting between two lossy formats many hundreds of times. By the end, the man speaking has degraded into a nightmarish mess of color and noise. This process is called generation decay.
As files travel around the internet, being copied and backed up and remixed and re-encoded, this data loss adds up, and files can become heavily degraded. As we get better at lossy encodings and less-efficient file formats fall out of favor, original versions can be lost forever.
Hopefully, movie studios care enough to keep a losslessly encoded version of Cool Hand Luke and Twelve Angry Men safe somewhere, so that we’ll always have high-quality versions of those files. However, this certainly isn’t true of most media. Your digital baby photos and home videos will slowly decay as you transcode them from obsolete formats into new ones.
The same goes double for online content. The originals of most YouTube videos likely no longer exist. When YouTube ceases to exist and those videos are migrated to a new platform, all of them will take a quality hit from the re-encoding process. A few generations of video-sharing platforms down the road, and even those videos that remain popular enough to be copied from platform to platform will be unacceptably degraded.
Vint Cerf, Google’s Chief Internet Evangelist, has talked at length about the dangers of throwing away all this information as cavalierly as we do. During one interview, Cerf described how in 2005, historian Doris Goodwin wrote a book on Abraham Lincoln, and studied his habits by visiting libraries across the country, digging up his old letters, and reconstructing the conversations they embody. Cerf notes that today, “those letters would be emails and the chances of finding them will be vanishingly small 100 years from now.”
This kind of data decay will pose a huge problem for future historians. The twenty-first century may well become a gaping hole in the historical record — a digital dark age.
Can We Do Better?
One solution to this problem is to develop archival storage that can last for much longer with less maintenance, so that it’s easier to archive information for the very long term. A number of smart people are working on this problem, and we’ve rounded up the best available data on their technologies.
So let’s say you want to back up a file for a really long time. How should you do it?
Solution: Magnetic Tape
If you only need to store your data for a few decades at a time, your best bet is probably good, old-fashioned magnetic tape (of the kind used by IT departments all over the world). Stored underground in a cold, dry, magnetically-shielded environment, with a healthy degree of redundancy, magnetic tape is relatively stable compared to conventional CDs or hard drives, and only about three times as expensive as low-end hard drives (about $3.0 per gigabyte).
Solution: Archive-quality optical disks
Conventional CDs are a terrible way to store data: the aluminum or silver backing starts to oxidize as soon as you open the package, and low build quality can cause other issues. Don’t expect them to last longer than a few years – hours, if you accidentally leave them in the sun. However, some CDs and DVDs are made with a gold backing and a much higher build quality. Gold doesn’t oxidize, which means that these disks can last a long, long time. It’s hard to know exactly how long, because we haven’t had them for very long, but we can get a good estimate by taking the disks, being really mean to them, and then trying to recover the data: this is called an accelerated aging test.
Based on these tests, manufacturers claim lifespans in the 1-3 century range. For maximum data density, you can pick up archival Blue Rays for about 2.5 gigabytes per dollar, with a projected lifespan of 200 years. Accelerated aging tests aren’t a sure thing, but it’s probably safe to count on them for a century or so. As a bonus, unlike magnetic tape, they require no special equipment to read and write, so startup costs are minimum.
Okay, forget that “century” nonsense, let’s get serious. To give you an idea of the timescale, one thousand years ago, Earl Eric Haakonsson outlawed berserkers in Norway for the first time. That’s these guys etched on a bronze plate discovered in the 20th century:
Until recently, there weren’t many good industrial options for this kind of timescale. However, recently, an exciting option has emerged called an ‘M-disc.’ These are archival DVDs made out of a thick layer of a “stone-like” mineral composite which is designed to be etched by special burners (though they can be read by normal DVD drives). These are absurdly robust, and expected to survive for at least a thousand years. That’s an ambitious claim, but the company has some solid research (including a study by the US Department of Defense) to back it up.
These discs are even reasonably cheap, at 5.7 gigabytes per dollar, though you’ll also need a special burner. If you’re seriously interested in storing a lot of data for a long time, M-discs are the clear winner.
Solution: Engraving extremely stable metals
This is where we start to stray from the beaten path a little. As of right now, there are no digitally-readable formats that can survive anywhere near ten thousand years. That means that any data archived for this duration is going to be very difficult to recover. In some ways, this okay — it’s not like DVD readers are going to be around in ten thousand years anyway.
So how do you store data for that long? The answer is that the only materials that can survive those kinds of timescales are chemically stable metals and gemstones. This technology has already been used in practice for the Voyager records — gold disks, engraved with information representing audio and images, which were launched aboard the Voyager probe. The probe is on its way out of the solar system in order to provide a lasting record of humanity for aliens to someday find.
A modern take on the issue is nano-lithography. A company called Norsam has adapted lithography techniques originally developed for engraving semiconductors, and can use them to etch fine patterns onto surfaces like diamond or nickel. The resolution is decent (about 165 gigabytes per 12 centimer disk), and it’s also practically indestructible. Stored safely, these disks should last for many thousands of years, and can survive EMPs, most fires, and the collapse of human civilization. Pricing information isn’t easily available, but “expensive” is a really good guess.
One early application of this technology has been the creation of modern “Rosetta Stone” plates, made out of titanium, to be stored in safe places around the world, containing about thousands of pages of text, translated between many languages, to provide a reference for future historians if some modern languages are lost. As a side benefit, the disks also look incredibly cool:
More Than 100,000 Years
Let’s be clear here: if you’re shopping for computer storage and nano-engraved titanium is just too short-lived for you, then your planning horizon terrifies me. One hundred thousand years ago, early man first began to venture off the African continent to Europe. If you really care about making sure that your digital data survives that far into the future, then you have departed the ken of mere mortals, and probably also sanity and good sense.
Which is not to say that you don’t have options.
Solution: Fossilized DNA
One of the perks of the biotech revolution is that there are plenty of companies that will create custom DNA for you out of a string of base pairs that you provide, online, for a marginal fee. Each base pair has four possible combinations, which can store two bits. The data can then be read by sequencing those genes at a later date, using a variety of techniques. This allows DNA to serve as a kind of exotic data storage. Now, by itself, your custom DNA chains are pretty short-lived, and will chemically break down at room temperature in a few years. There are a few ways to extend its lifespan.
You could splice your data into the DNA of a long-lived organism, like the Great Basin Bristlecone pine (which is known to live more than five thousand years). Because these trees can reproduce, your primary concern then becomes keeping them safe from the numerous large-scale fires, meteor impacts, and volcanic eruptions that are going to happen in the future. You might be able to get your data to survive for a few tens of thousands of years by planting several forests of archival trees in safe, remote places; but – of course – you’re not interested in such small potatoes.
In order to really get your money’s worth out of DNA storage, you need to chemically fix the DNA to protect it against chemical change and radioactive breakdown. Researchers have found a way to imbed DNA into molten glass in order to create a “synthetic fossil” that will protect the DNA for extremely long periods of time. The process is based on natural fossilization, and was developed after the revelation that it is often possible to extract intact DNA from fossils millions of years old. With proper use of error-correcting codes and redundancy, there’s no reason you couldn’t preserve many gigabytes of information for single-digit millions of years.
In terms of cost-effectiveness: if you’re worried about price, this storage method isn’t for you. This is not a commercial process by any stretch. You are going to be spending at least hundreds of thousands of dollars to have the DNA fabricated and preserved. This is not an undertaking for the faint of heart. Still, it’s an option, and if you really want to make sure that the most important data on the Internet is still available long after humanity is dead and gone, it is within your power to do so.
Are you concerned about the digital dark age? What data do you want to preserve for future generations? The discussion starts in the comments!
Image Credits: broken usb drive Via Shutterstock, “Berzerkers,” by Wikimedia, “Cutaway,” by M-Disc, “Rosetta,” by the Long Now Foundation, “Rainbow CD,” by Wikimedia, “Magnetic Tape,” by Wikimedia, “Time Capsule,” by Wikimedia, “Voyager Record,” by Wikimedia, “Fossil,” by Wikimedia