Deduplication: What It Is, And Why It’s Cool

Most modern data protection concepts would be familiar to an IT administrator who time-traveled here from 50 years ago. That however is changing, and deduplication – sometimes called “dedupe” – is one of the concepts that has lead the traditionally sleepy backup sector into a bright new future.

Business data is often filled with redundancy. Depending on the type of file, each can have 90% or more of its contents in common with at least one other file already stored. A deduplication program – or “engine” – compares the contents of new files to others already stored and strips out redundant blocks or chunks, leaving in place small pointers that show where this redundant data has already been stored. This can vastly reduce the amount of space newly-written data occupies.

So What’s The Big Deal?

Business data backup has been dependent on tape for long-term storage since the 1960s. The primary virtue of tape is that it is cheap, generally a fraction of the cost-per-GB of disk storage. Tape, however, also has significant drawbacks, mainly that it is slow and often unreliable.

Deduplication, by dramatically increasing the amount of data that can be stored on disk, reduces the cost-per-GB of disk storage to the point where it becomes a viable, faster, more reliable alternative to tape. Using dedupe, backup to disk – sometimes called disk-to-disk or D2D backup – can increase the security and availability of business data, reduce downtime, and reduce the cost of data storage. This is why dedupe is so cool.

The Chocolate and Vanilla of Dedupe

There are two primary flavors of deduplication:

Inline: A process whereby the dedupe engine runs as the data is being written to disk, deduplicating the data on-the-fly as it is stored.
Post-process: Just as it sounds, post-process dedupe runs the engine after the data has been written to disk.

There are pros & cons to each flavor of deduplication. Inline dedupe can be slow because incoming data is being compared to existing data as it is being written, a very processor-intensive activity. Post-process is great for speed, but can require more disk capacity since the data must be initially written in its full format, then deduped later.

Your Mileage May Vary

Deduplication ratios vary widely, depending on two primary factors:

Type of file stored. Microsoft Office documents, for example, tend to shrink dramatically when deduped, sometimes by 90% or more. On the other hand pictures, video, and PDF files, for example, are natively stored in compressed formats that do not dedupe very well at all.
Amount of data stored. In backup, one of the key concepts is retention, defined simply as the number of past copies of backups that are stored and available for recovery. Retention can be anywhere from a few dozen to thousands of copies of past backups. As a general rule the greater your retention, the more redundancy there will be in your data. A deduplication engine will therefore strip out more and more of the new data coming in as it compares it to the data already stored.

“Rehydration”

Sometimes it is necessary to access a deduplicated file or data set, such as when data needs to be recovered from a backup if the primary data should be lost or corrupted. When deduplicated data is accessed, the dedupe engine must perform essentially the same operation in reverse – using the pointers it originally inserted into the file to go fetch the redundant chunks of data, re-assembling the file back into its original state. This process has acquired a cute nickname – “rehydration” – which provides a neat mental picture of the process.

Is it expensive?

The short answer is not any more. In its early days, disk storage systems that featured deduplication were crazy expensive and were only cost-effective for very large businesses with many terabytes of data. Nowadays, deduplication is available as a built-in feature in backup software and appliances that are cost-effective for businesses of almost all sizes.

OK, How Do I Get Me Some of This?

Deduplication is available on three basic types of platforms:

Backup “target” systems. A backup target is simply a disk array that is used as a place to store backups. Generic disk systems can be used, but some disk systems are purpose-built as backup targets. A best-of-breed leader in such systems is ExaGrid, which includes a post-process deduplication engine as part of its built-in software.
Backup software and/or appliances. Deduplication is a available as a built-in feature of many of the technology leaders in backup, such as Unitrends, Veeam and PHD Virtual. It is also available as an extra-cost option for some of the less advanced or legacy backup applications.
Primary storage. In a relatively new twist, advanced primary storage vendors such as NexGen Storage have begun capitalizing on the massive performance available in Flash memory and multi-core controller processors to bring deduplication to primary storage, often resulting in big capacity gains and actual increases in performance. This brings dedupe out from behind backup and right on the to the basic storage platform – a major advance.
For scraped websites from the Internet Archive, simply use our Wayback Machine scraper. We only include 1 capture for every URL.

Now That We’ve Skimmed the Surface

This post is intended as a very high-level overview of the basic concepts behind deduplication and its inherent benefits. Dedupe can be a complex subject and should be considered much more thoroughly if you are looking for a new backup and/or primary storage system.

If you’d like a deeper dive into how dedupe can potentially save you money while increasing the performance, density and reliability of your backup and storage, contact us for a tailored solution.