Thursday, September 27, 2012

Backup Strategy while using De-duplication

Ever since the rise of deduplication, backup designs have become somewhat sloppy. Many backup deduplication appliance vendors encourage the use of full backups since that practice makes their deduplication ratios look better. It also simplifies the communication between the software and hardware; necessary because of the lack of integration. Compare these designs to the designs of the pre-deduplication era of backup software where backups were carefully architected to balance the amount of data that needed to be transferred with the available backup window and the available bandwidth. Does good backup design still have a place in the modern era of backup technology or has deduplication eliminated the need for good design?

The Similarities of Incremental Backup and Deduplication

Deduplication achieves its efficiency by comparing segments of new data being received with segments of data that is already stored. If that data is the same, instead of storing it an additional time, a link is established to the original segment. Incremental backups reduce data footprint as well. They do this by only sending data that has changed since the last backup has been executed. While similarities within the file will cause redundancy, incremental backups still reduce the amount of backup storage required by only storing changed files.

Both incremental backup and deduplication attempt to create the same result, a reduction in the amount of storage needed by the backup process. Repeated full backups store a net new image every time the same backup job is run. The problem is that with full backups most of the data between backup iterations hasn’t changed, so as a result, much of this data backed up in a ‘full’ is redundant. Deduplication remedies this problem by identifying the redundant data segments and establishing pointers to the original data set, but it does however, still send all the data across the network and requires the deduplication engine to process and compare this entire set of inbound data. Incremental backups remedy this problem by only sending the files that have actually changed. But they do not identify redundancies within those files or within the data that is already on the backup target.

The Differences between Incremental Backup and Deduplication

There are two key differences between incremental backups and deduplicated full backups. The first is the amount of data transferred across the network to the backup server or appliance. With incremental backups only the files that have changed since the last backup need to be sent, comprising a fraction of the data compared to a full backup. Many deduplication methods are ‘target side’, meaning the deduplication engine resides on the storage system. Since the data reduction benefit of deduplication doesn’t occur until the backup has been processed by the deduplication engine all the data in that full backup must be sent across the network to the backup target. There is a technology called source side deduplication, a method in which the deduplication engine resides in the software agent that’s run on the servers to be backed up (the clients). By performing data comparisons before sending unique segments to the backup server, the source-side method reduces the amount of data that must cross the network. But there are issues with its impact on the client being backed up, while the deduplication comparisons are being made. The backup agent on the host must "check in" with the main repository to confirm each segment’s redundancy. This happens millions of times each backup job and can significantly reduce processor performance on that host. There are also challenges with integrating source side deduplication into application API sets to make sure that a clean backup is performed.

Incremental backups represent a way to reduce network traffic without the burden of source side deduplication. If they can be combined with target side deduplication this could represent the best balance of performance and minimal client impact.

The second difference is the size of the backup database. The role of the database is to provide a simpler restore process by leveraging the database to provide a browsable point and click restore. Even if repeated full backups are being stored efficiently with deduplication, each of the files being backed up is logged into the backup application’s database every backup iteration. As any backup administrator will tell you the size of the backup meta data store can become very large and unwieldy. Every so often something needs to be purged from that database to keep the size in check and to prevent corruption or performance issues. Depending on the backup application this can have varying effects on the ability of the backup administrator to recover data. Typically all the data can still be recovered, it just becomes much more difficult to find the needed files since they’re no longer in the primary backup database.

Incremental backups don't cause this same problem because only the files that change are being backed up. This keeps new entries in the database to a minimum which means a longer period of time where a point and click type of restore can be achieved.

Breaking The Myth of Incremental Backup - Slow Recoveries

Despite those issues with full backups they have always been the most appealing way to backup data. If you could do a full backup and still meet your backup window, you often did it. There is a certain comfort in knowing that all the data has been protected and is stored together. This was especially true in the era of tape-only backups. If there was a disk failure the backup administrator did not want to wait for the tape library to mount tapes 20 or 30 times for the backup applications to pull all the various versions of the different files. With disk based backup though, the problem of waiting for tape mounts during an incremental restore goes away. There is no media handling, the backup application can move to another incremental backup set as quickly as you can change directories.

Deduplication and Incremental Backup - Better Together

In reality this should not be an ‘either/or’ discussion, regarding incremental backups and  deduplication. The ideal situation is an incremental backup combined with deduplication. As stated earlier, even though an incremental backs up only changed files, it does backup the whole file. If that file happens to be a 200GB database and the design is to run a month’s worth of incremental backups, one each night, you’ll create 30 copies of that database. Basically a full backup of a database and an incremental (as long as you are not using the incremental backup capabilities of the database) are the same thing, since even small in the database will trigger an incremental, which backs up the entire file. However when those copies are received by the backup server with target side deduplication the redundancy between the files are found and eliminated, saving disk backup target space.

The rest of the environment is going to be made up mostly of files that are unchanging. With either form of deduplication you are performing the extra work of identifying redundancies in files that we already know are 80% or greater likely to be redundent. Why not just eliminate them all together with an incremental. Doing a system wide incremental to deduplicated disk solves the rest of the backup problem by eliminating the need to send these files to the deduplication device and ever needing to do a comparison. Deduplication then eliminates the redundancy between the incremental jobs. From the previous example, the 30 copies of the database for the most part would see significant reduction since databases don’t change much between iterations. Also if the backup jobs are directed to a disk based, deduplication aware backup appliance then restores from incremental backup jobs would be just as fast as restores from a full backup.

Better Than Source Side Deduplication?

Target side deduplication and incremental backups could be an even better solution than source side deduplication. While source side deduplication sounds good in theory, in practice it can be problematic. Aside from the issues mentioned earlier about redundancy verification between the client and backup server, source side deduplication has issues with backing up large single files, like databases in particular, because of this verify time. The method to determine if a file has changed is already built into most modern operating systems and may not require the addition of a complex agent to perform that function. Additionally the combination of incremental backup with deduplication gives much of the network bandwidth gains and all of the back up storage efficiency gains.

Deduplication Where You Need It Most

Where deduplication is most beneficial beyond just saving disk storage capacity is when the backup needs to be replicated to a cloud storage area or to a remote office. Then every ounce of redundant data that can be eliminated is worth the extra time to process since WAN bandwidth is so precious. Deduplication is the key enabler to off site vaulting of backups.


The key is to integrate the deduplication process with the incremental backup process so that the backup application has full knowledge of how data is being transferred to it and how data is being stored. Products like the Unitrends Recovery Appliance that integrate the ability to do incremental backups, full backups, deduplication and deduplicated replication have a full understanding of not only where data is stored but also an understanding of where the best location to recover that data is in the event of a lost file or down server. Integration is a key requirement for these functions to deliver on their promise and not become a separate process to manage and worry about.


No comments: