Clumio’s Rapid Recovery is amazing, and you should know more about it.
You might not have heard of Clumio before. Clumio is an upstart SaaS-based backup solution for both cloud and on-premises environments where the storage endpoint is in the cloud instead of a storage platform in your own datacenter. I’ve been exploring their offerings, since I’m a data nerd and always intrigued by these sorts of things. I gave their Rapid Recovery architecture a solid once-over with the flagship SQL Server availability solution architecture called Availability Groups.
The physical environment that I performed the tests on consists of:
- Two HPE DL380 Gen9 servers
- VMware vSphere 6.7, latest update
- Pure Storage //M20 all-flash SAN
- 10GbE iSCSI connected storage
- VAAI is active in the vSphere architecture
The SQL Server testbed VMs were configured as the following.
- Four total virtual machines
- Two SQL Server 2019 Enterprise edition and two SQL Server 2016 Enterprise edition VMs
- vHardware compatibility level 15
- 4 vCPUs, 16GB RAM
- Six hard disks, spread amongst multiple VMware Paravirtual SCSI controllers, for a total of 500GB of consumed space per VM
- Windows Server 2019 Datacenter operating system
- A fileshare witness for the Windows Server Failover Cluster was configured on a third VM
Two SQL Server Availability Group pairs were configured, each on 2016 and 2019 respectively. The database VMs were then setup with Clumio to replicate their backups to the cloud, and my Internet provider is fast enough that I was able to replicate this up to the cloud within a day.
Now, let’s get a stream of end-user traffic to change some data. I used the HammerDB synthetic database benchmarking tool to generate a workload in a database called ‘tpcc’. I built an initial database at 400GB on each primary instance of the Availability Group with the benchmarking utility so we had some pseudo-real data to work with. Once constructed, I set up a pair of users on a standard workload on a 24-hour timer to continuously insert a stream of data changes into this database, all while Clumio was backing up the servers on a periodic basis.
After a day, I shut down the VMs, and went to the Clumio web-managed interface to restore these VMs into new VMs for database validation. I first instructed Clumio to restore the first VM in each Availability Group replica pair, then the second.
Each restore took 4.5 minutes to restore the entire virtual machine and have it running and active in VMware. Four minutes to restore a 500GB VM of active database data? That’s absolutely incredible, even for an on-premises solution. Given the fact that this is a cloud-native solution is almost unbelievable if I had not witnessed it myself. I’m floored at just how quick this performed.
But did SQL Server come up? Short answer – yes! Restoring a SQL Server Availability Group solution, especially given the Windows Server Failover Cluster (WSFC) configuration underneath, is a delicate but straightforward process.
I first restored the first of the two VMs. The presence of the file share witness allowed the WSFC to come up properly without issue. The first of the two AG replicas also turned right on successfully with no errors in the error logs. The second VM was then restored, and because the restoration point was from the same point in time as the first replica, the databases on the secondary replica were successfully able to come back online and re-synchronize with the primary replica.
It. Just. Worked.
If the second VM had been restored from a different point in time, the database synchronization inherent to the AG could have caused a database transaction log pointer mismatch, and would have meant that the secondary AG database copies would have needed to be re-synchronized. Nothing in this process of re-synchronization is any different than if the database servers had been restored in a more traditional backup and restore process, and is quite normal for DBAs to need to re-sync as needed for these sorts of scenarios.
The process was repeated both on the SQL Server 2019 and 2016 servers, and worked as advertised each time.
During the initial backup streams, there were never any issues while taking database transaction log backups, including dropped transactions or application-level errors.
The speed of restoration process exceeded anything that I could have envisioned. The magic performed in the Rapid Recovery process made this restoration process quick without the need for additional fast storage in your own datacenter. I’m thrilled to have explored this, and am quite eager to continue to work more with the technology!