Data Deduplication Addresses Key Storage Concerns

That data deduplication is one of the hottest storage-related technologies should come as no surprise. These systems promise streamlined storage, reduced costs and improved performance.

Given the explosion of information and the desire to provide more cost-effective storage, what IT executive wouldn’t want to consider deduplication?

Deduplication technology identifies variable-length blocks of data across various files and file types. It then stores unique blocks once, replacing redundant blocks with “data pointers.” When an incoming data block is a duplicate of something that’s already stored, the block isn’t stored again. If the block is unique, it’s stored on disk. With data deduplication, organizations can store a lot more data in a smaller amount of physical storage space.

An Enterprise Strategy Group (ESG) report published in January, 2008, found that 64 percent of deduplication users have experienced a 10:1 or greater capacity reduction ratio.

Not surprisingly, a fall 2007 ESG survey regarding data protection indicates that use of data deduplication products is expected to increase significantly. More than one-third of the respondents said they plan to use file-level data deduplication in the future, and 25 percent expect to add sub-file deduplication capabilities.

The increasing reliance on disk storage and the growing volume of data that needs to be protected are two factors that make deduplication so enticing, says Lauren Whitehouse, an analyst at ESG. “The economics of applying disk in the backup process further improve with deduplication,” she says. “By implementing data deduplication technology that identifies and eliminates data redundancy, the amount of data that is transferred and stored is reduced.”

But it’s not as simple as selecting a system and letting the technology do its thing. “The biggest challenge for many organizations is understanding their requirements and selecting a vendor/solution that meets its needs,” Whitehouse says. “There are a lot of vendors and approaches to wade through, which also means there are many choices to find an optimal solution.”

Whitehouse says one of the key concerns is whether to go with a software or hardware approach. Typically, backup software performs deduplication at the source, identifying duplicates before data is transferred across a network. Among the benefits are integration with the backup software; more intelligence about the data set; less data transferred across the network, which is especially important in server virtualization environments where there’s a lot of redundant data; the ability to use any type of disk; and fewer issues with scalability. The drawback is that not all backup vendors offer deduplication.

Deduplication is also a feature of many storage systems and hardware appliances. The benefits are that the deduplication hardware is optimized for the process, works with any backup software and can be implemented quickly, Whitehouse says. One drawback is that not all solutions scale, causing issues when capacity thresholds are hit.

Some products perform deduplication at the file level, others at a block or byte level. Whitehouse says the differences between the approaches have to do with computational time, accuracy, level of duplication detected, index size and scalability. File-level deduplication checks file attributes and eliminates redundant copies of files stored on backup media. Although this method delivers less capacity reduction than other methods, it’s simple and fast.

The general rule, Whitehouse says, is that the more granular the segment being inspected, the more redundancy that can be detected. Also, the smaller the segment being inspected, the more segments that need to be examined and compared, which could take longer.

Another consideration is how deduplication is applied in an environment. “If there are multiple sources being backed up, does [deduplication] occur across them, in addition to within them?” Whitehouse says. “This could occur in cases such as multiple remote and branch-office backup consolidation, or multiple systems replicating to a [disaster recovery] site.”

One of the biggest risks in selecting a deduplication solution is long-term viability. The most popular deduplication approach today is to install a disk-to-disk hardware system. But these solutions don't scale, which means that when a company hits a capacity threshold, it must upgrade the solution or add another system. Also, with multiple appliances in one environment that can’t be centrally managed, there are new management challenges.

Determining which approach and vendor is best depends on specific needs and a company’s tolerance for drawbacks. Deduplication requirements for remote and branch office consolidation might be different from those for data center backup. “Organizations need to map priorities and requirements and vet solutions based on these criteria,” Whitehouse adds.