We are going to overview the concept of deduplication and review its approaches available on the market. You will also learn about backup deduplication specifics and efficiency.
What is Deduplication
Deduplication (dedupe) is a sort of data compression concept, which allows you to lower the volume of stored data. But it does this job by eliminating stored data copies, instead of using data optimisation techniques like ZIP compression.
Typical corporate data storage is used by many users and systems, which can use the same data assets. So it is a typical case that many of the files have their copies for other users or systems. Deduplication allows you to store only one data copy for any user or system.
There are two main dedupe approaches available on the storage market:
- File-level deduplication - works with filesystem objects by checking if the same objects (files) are already stored.
- Block-level deduplication uses the same approach as file-level dedupe, but uses data blocks (chunks) as “objects”.
There is also byte-level deduplication (operates with byte objects, from which a data chunks consists), but its overhead is too high to effectively use such approach in the real storage systems.
Dedupe not only allows you to save on storage costs, but it also allows you to speed-up cross-site communications by not sending multiple copies of the same data across WAN connections.
File-level dedupe allows you to skip storing of a several file’s copies - they are just replaced with the “link” to the original file. Objects’ “fingerprints” are used to check whether the file is already placed into the storage. It is a kind of unique number which is being checked against the list of already stored files. This fingerprinting technique is often based on hashing methods or file attributes (depends on the particular storage dedupe solution).
File-level dedupe is much easier to maintain, but it allows fewer storage savings than block-level dedupe. If operating on the file level, the system treats any small file change as if the new file was created. That is why you cannot effectively deduplicate files that are often modified by users, for example.
This dedupe type is often software-based and acts as a “medium” between your storage drive and applications. But it is one of the fastest and simplest dedupe techniques since its indexes are small and take less time for computation.
On average, file-level deduplication allows you to save on storage space as high as 5:1. Most significant storage savings are typical for shared storage (NAS systems, shared folders, archives), since it often has multiple copies of the same files. Particular file types also influence the efficiency of deduping data: images or audio files tend to be unique and cannot benefit from deduplication; but documents, templates, and internal system files are common across a wide range of systems - a good dedupe ratio is possible here.
Block-level dedupe goes deeper and checks the uniqueness of the blocks from which any file consists. If you have modified a file, the system will store only chunks changed: any chunk has its own number (typically generated using hash algorithm) that the system can check in “already stored” metadata.
Block-level dedupe saving ratio can even be 20:1
Such an approach allows you to save even more space but requires more computation because the number of objects to be processed is much larger. You can typically influence the “ratio-performance” balance by setting block (chunk) size:
- Smaller blocks are most likely to be duplicated, so you can save more space in that case. But keep in mind that the index metadata will grow much higher and processing speed will be lower.
- Larger blocks will be unique in many cases, but you can get more processing speed for dedupe storage operating with larger blocks.
Available block size depends on the particular dedupe solution. This is kind of a “know how” that the vendor can get during laboratory experiments. In most of the available solutions the block size range is 8-64 KB.
Backup Data Specificity
When using a backup system with the cloud-based backend, you may want to implement deduplication for your cloud storage (less data stored - less money spent). But many storage providers don’t provide their native deduplication option, or allows using it for an additional price. So you can implement independent deduplication software that will upload only deduped data to the cloud.
That is why we implemented the CloudBerry Dedup (CBD) server - a special deduplication solution optimized for CloudBerry Backup data. The program works in conjunction with CloudBerry Backup and processes all the data received from client workstations and servers. CBD processes collect data after the backup has completed, so you will not break desired backup windows. Dedup server processes backup data on a block level and moves only unique blocks to the “waiting for upload” area, allowing you to save up to 90% of a cloud storage.
Note: CloudBerry Dedup server is in BETA state as of the date of this publication.
You may think that CloudBerry Backup already has the same “block-level backup” technology, but this is not entirely true. Block-level backup only allows tracking of data changed since the last backup, but it is not checking this block’s uniqueness. If you will make the same changes twice in one particular file after every backup run - CloudBerry Backup will copy the same changed blocks twice. So it allows you to save some space on copying only modified data, but it cannot check whether this data is unique or not.
Since file-level and block-level deduplication techniques have their own pros and cons, let’s summarise general use suggestions:
- Use file-level deduplication for slow storage systems or in case you need to store a lot of similar files (shared folder with users’ files, for example).
- Use block-level deduplication for files changed often since it saves only the data blocks changes. File-level dedupe solution will copy the entire file in case of any small change in that file.
- Use block-level dedupe for long-term archives and backup data. You will spend more time restoring such data, but storage savings (especially for cloud storage systems) are much more significant. In case of backup you can store the hottest backup data on the regular storage, while pass long-term archives using deduplication appliance - such approach allows you to combine the fastest recovery with lower storage costs.
Note: always ensure that deduplication metadata is safe. If you lose this metadata - you will lose all deduplicated data.
Deduplication process allows you to lower the volume of stored data and to optimize storage spendings. But you need to carefully implement particular deduplication technology, taking into account your data characteristics. For example, use block-level deduplication for backups and any files that are changed often.
We have developed a deduplication solution for CloudBerry Backup that is highly optimized for this particular use case. So you can try our data deduplication software, CloudBerry Dedupe Server for FREE and check whether it meets your business needs.
If you have questions left - do not hesitate to contact us!