Windows Server 2012 introduces native deduplication functionality. My first reaction on hearing about this feature was to wonder whether it was viable for SQL Server database backup files. Certainly I would expect volatile databases to have less opportunity for space-saving deduplication, but for the more static databases I can certainly appreciate the potential for deduplication space savings.
I will be describing my informal, preliminary testing process and observations in this article. While this is a promising new Windows feature for other file types and characteristics, I do think there are some potential pitfalls that you need to be aware of when it comes to deduplication specifically for SQL Server backup files. I’ll explain why I think this towards the end of this article.
First of all, to configure Windows deduplication functionality, I added it as a server role via the “Add Roles and Features Wizard” (this functionality is nested under File And Storage Services \ File and iSCSI Services \ Data Duplication).
Adding this feature also meant that I was installing the parent features in the hierarchy.
It was also noteworthy that for my test virtual machine, installing this feature didn’t require a reboot.
On my test VM, I provisioned a dedicated “T:” drive that I would use for storing local SQL Server database backups, isolated from files on the other drives. Then to enable data duplication for the T: drive, within Server Manager I right-clicked the volume and selected Configure Data Deduplication.
In the Deduplication Settings dialog box, I checked the Enable data duplication option.
I kept the other defaults unchanged (including the 5 day deduplicate files setting) and didn’t change the deduplication schedule because I planned on manually kicking off the deduplication process using Powershell.
Now for a quick test in SQL Server 2012, I created ten separate but identical full database backups of the Credit database (without using native backup compression). Now of course, depending on the database we would normally expect incremental changes over time that would translate to reduced duplication opportunities, however I wanted to measure the “perfect world” scenario in order to understand overall potential. This would help me understand the high watermark for potential deduplication savings:
BACKUP DATABASE [Credit] TO DISK = N'T:\credit_1.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_2.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_3.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_4.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_5.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_6.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_7.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_8.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_9.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION; BACKUP DATABASE [Credit] TO DISK = N'T:\credit_10.bak' WITH NOFORMAT, NOINIT, NO_COMPRESSION;
Each file was 160,864 KB. I then executed the following Powershell cmdlet from the deduplication Cmdlet library to manually kick off a deduplication job on the T: drive:
Start-DedupJob T: -Type Optimization
Then to periodically check the status of the job, I executed:
Get-DedupJob
The job took about two minutes. I could see a progress percentage and once the job was finished, Get-DedupJob no longer returned a result set.
I then checked the overall status using:
Get-DedupStatus
This returned:
FreeSpace | SavedSpace | OptimizedFiles | InPolicyFiles | Volume |
--------- | --------- | --------- | --------- | --------- |
39.55 GB | 1.52 GB | 11 | 11 | T: |
What was interesting was the “11” files being reported, even though there were only 10 files on the T: drive. I then remembered that my original Credit backup was in my Recycle Bin and I went ahead and deleted it permanently. This didn’t cause the status information to be immediately updated, so I executed the Start-DedupJob cmdlet with the GarbageCollection type:
Start-DedupJob T: -Type GarbageCollection
Checking the status, I saw revised file counts to the actual 10 files on the T: drive. The new space savings was as follows:
FreeSpace | SavedSpace | OptimizedFiles | InPolicyFiles | Volume |
--------- | --------- | --------- | --------- | --------- |
39.55 GB | 1.47 GB | 10 | 10 | T: |
The ten files took an original 1.76 GB and with deduplication enabled, used 0.2998 GB (307 MBs) instead (an ~ 82% savings):
Before Deduplication | After Deduplication |
|
|
What if my backup files were compressed via SQL Server backup compression? Would I still get as pronounced amount of saved space?
I deleted the existing backup files and reran the ten full database backups using the COMPRESSION backup option instead of NO_COMPRESSION. This time 10 separate backup files were created sized identically at 54,077 KB (versus the original 160,864 KB) on the T: drive.
As before, I manually started the deduplication task via the Optimization type and after a minute I checked the status via Get-DedupStatus:
FreeSpace | SavedSpace | OptimizedFiles | InPolicyFiles | Volume |
--------- | --------- | --------- | --------- | --------- |
39.55 GB | 473 MB | 10 | 10 | T: |
The ten files took an original 769 MB and the saved space amounted to 473 MB – so a 61% space savings. Not as high as the 82% but still an excellent space savings for identical files.
What about database backups with Transparent Data Encryption enabled? I tested this as well (with NO_COMPRESSION
given that TDE was enabled for the Credit database). Each identical TDE database backup file was 160,864 KB in size and the deduplication rate was 77% across the 10 files versus 82% for the non-TDE database). So there was still significant space saving.
This is definitely a promising new feature, although one’s mileage will vary based on the redundancy within the files you wish to deduplicate. New transaction log files have inherently different activity – as would the associated full database backups when a significant portion of data is modified between generations. In a production system, I expect much lower space savings for volatile databases – but for those relatively static databases that still get frequent backups, I would still expect some good space savings.
So should you actually use this technology with SQL Server backup files? In my opinion, there are a few areas that you should watch out for:
- There is an increased risk of single-point-of-failure. Your deduplicated files point back to single, shared data chunks. If disk corruption occurs for that chunk, it impacts all referencing files (in our discussion, multiple SQL Server full backup files). This is discussed in the Reliability and Risk Mitigations section of the following post by Microsoft’s Scott M. Johnson -> Introduction to Data Deduplication in Windows Server 2012. This same article talks about redundancy for popular data chunks when referenced 100 times, but that still leaves plenty of exposure from a SQL Server backup file perspective. File replication and/or copying of backup files to another location that has deduplication may help build in some redundancy to reduce the single-point-of-failure risk exposure, so if you do still plan on pursuing this feature with SQL Server backups I’d recommend exploring that redundancy further (I’d recommend multiple copies of backup files anyhow, with a combination of local and remote copies).
- If you move the entire volume itself (for example on a VM, detaching the virtual disk and re-attaching to a new VM) with SQL Server backups to another server that doesn’t have data duplication enabled, you will only be able to access files that have not been deduplicated. This is a key gap from a SQL Server recoverability plan perspective. I did test the detaching of an entire deduplicated volume containing SQL Server full backups to a new server where the feature was disabled. The RESTORE operations from the backups on that migrated volume failed. After I enabled the deduplication feature for the server and volume, I was then able to use the SQL Server full database backup files again. I also tested copying a single deduplicated full backup file to another file system and I was able to restore it to a separate SQL Server 2012 instance. The act of copying this file off the server allowed it to be “rehydrated”, so-to-speak.
- Another reason to be cautious about using this functionality with SQL Server is its impact on concurrent SQL Server activity. Based on the specifications of the feature, the background process can be light-weight (low hash index memory footprint, lower CPU usage, running during quiet periods), however nothing is for free and if you’re already on a constrained system, the background de-duplication of volumes and files may tip your SQL Server performance over the edge. It remains to be seen what impact we see on real mission-critical SQL Server database servers.
Microsoft has a deployment page here that helps walk through some of the considerations –> Plans to Deploy Data Deduplication. Space savings can be pretty significant, but as you noticed I tested this for SQL Server full database backup files and not SQL Server database files themselves, which are not supported according to the deployment guide. That deployment guide states “Deduplication is not supported for files that are open and constantly changing for extended periods of time or that have high I/O requirements, for example, running virtual machines on a Hyper-V host, live SQL Server databases, or active VDI sessions.” This non-support of database files makes sense to me, and I think there are still significant benefits to this Windows Server 2012 feature, just not for SQL Server purposes.