Database professionals of the world – I have a question. Has your organization defined service level agreements (SLAs) for your data estate? I’m talking specifically the Recovery Point Objective (RPO) and Recovery Time Objective (RTO), and to have these defined not in an arbitrary number of nines, but in minutes or hours. If these aren’t defined from above, your business continuity plan is doomed to fail.
In basic terms, RPO is how much data, usually measured in time, of data your business is willing or able to lose if a system failure occurs. RTO is the amount of time that the business-critical systems can be offline before the outage causes a disaster for the organization. These metrics should be defined both for planned and unplanned outages. A planned outage could be that a critical system needs routine maintenance, such as software updates or operating system patching. Planned outages are sometimes not factored into system designs, so core systems can go unmaintained and lead to security issues or platform instability. Unplanned outages can be as small and limited in scope, such as an operating system freezes and needs to be restarted, or as large as a critical site outage has occurred because of a centralized storage failure or a natural disaster.
When you design a business continuity strategy, these SLAs must be defined not by a given database availability feature that you might want to use, but by the C-level in your business. The business must be on board with these metrics from the top-down. If the business hasn’t (or won’t) define these two metrics, the unwritten expectations in the minds of the leaders are that any outages will result in no data loss and have near immediate recoverability. They might tell you best effort, or give vague requirements, but without the formal SLAs having been defined, an outage will bring out the ’best’ in people when the systems are offline longer than the business can sustain.
In some cases, then, failure to define the SLAs mean the business continuity strategy is left to the implementers in IT without clear design targets. At that point, it becomes best effort. The constraints of designing a modern data platform, be it in the cloud or on-premises, mean the availability and disaster recovery options are limited by the budget of the IT organization. This budget rarely allows for a design that meets or exceeds these unwritten expectations.
In other cases, members of IT want experience with a certain continuity feature, such as Microsoft SQL Server availability groups. But, without the defined SLAs as targets, designing a continuity architecture by starting with the features rarely gives the ‘right’ level of SLA, and usually results in overcomplicating the architecture. Overcomplicating the platform almost always results in additional outages, or longer outages, at the end of a given calendar year, which defeats the purpose of the architecture.
So, let’s say your organization has now properly defined the SLAs. Examples of SLAs that we usually see in the field are RPOs for no data loss for minor localized incidents and 30 minutes for larger incidents, and an RTO example of 30 minutes for smaller-scale incidents and 24 hours for a larger scale outage. At this point, we can then evaluate our options and start to select certain techniques and features to meet (and hopefully exceed) the SLAs.
Three main items now need to be planned – a backup and restoration solution, high availability, and disaster recovery. These tend to blur some lines and overlap a bit, depending on the solution. For more complex platforms like database servers, the OSE and the databases need to be considered separately. If a backup solution can accommodate both to the same degree to meet or exceed the SLAs, single solutions are better than multiple solutions that might collide or compete.
At this point in the process, I always ask a lot of questions now that the SLAs are defined. First, how advanced or “seasoned” are the staff members managing these servers? Would a stand-alone server with no fancy HA or DR configurations meet the SLAs? Can the staff support a more complex architecture, and if not, are they willing (or able) to learn? That question is much harder than it sounds, as most won’t admit if they cannot support a solution that they might want more experience managing. If not, consider the architectural options carefully, and consider bringing in outside help to help engineer and train on the solution.
Start mapping out available features and solutions, based on what the staff can support (or external folks to help engineer and train). Let’s assume that the staff are fairly seasoned and can support a more complex environment. Let’s also assume that these servers are ones in my wheelhouse, namely SQL Server VMs. Map out the features available for HA and DR and their SLAs.
Layer | Feature | RPO | RTO | Note |
SQL Server | Availability Groups (synchronous) | Zero | Less than 30 seconds | Assuming not located on same storage device and that device does not fail |
SQL Server | Availability Groups (asynchronous) | Low | Less than 30 seconds | “Low” RPO is based on rate of data change and bandwidth to replicate data to secondary replica, and is usually less than one minute during business hours |
SQL Server | Failover Cluster Instance | Zero | Less than five minutes | Assuming shared storage failure does not occur, if present |
SQL Server | Log shipping | Low | Low | “Low” RPO based on log replication timing and available bandwidth. “Low” RTO is based on staff availability to perform destination promotion to live and time to change app connection config |
Infrastructure | Backup replication | Varies | Varies | Varies based on backup software features and platform speed for recovery |
Infrastructure | Storage replication | Low | Varies | Subject to SAN LUN-level replication windows and platform/application changes required to promote replicated copy to active |
Infrastructure | VM-level replication | Zero to low | Varies | Synchronous or asynchronous, depends on bandwidth, but could lack point-in-time recoverability, and RTO varies greatly |
Map out your available options carefully, and note which layer is responsible for the availability action. Knowing which layer is responsible means you can train the people involved, as a system administrator might not be comfortable restoring a SQL Server Availability Group architecture without the help of a DBA. Planning now for the personnel required will help speed up the recovery time in the event of an actual emergency.
Understand the nuances that accompany on your specific platform, such as available bandwidth, endpoint latency, or dependencies on other items in the environment. Just know that a database being online and available doesn’t mean the application can connect, so the data platform is still down for the users. Document and ramp up the level of staff involvement for manual processes, such as application-level changes that must be made to get the application talking to the database, or public IP addresses updated for a web site to fire up. Add variables such as complexity of the architecture, processes if key staff are unavailable, and expected time to recovery for various scenarios impacting the design. Document and identify common situations that might take down a system or site, such as small-scale events like a bad OS-level patch, or larger events such as a hurricane hitting your primary datacenter and taking out the power for a week.
Build your prototype platform. Validate the architecture with failover and fail-back testing. Rarely do organizations factor the fail-back portion of a business continuity strategy, and that usually only becomes apparent during the fallout from an actual emergency.
Finally, test. Test. TEST. And then test some more. And don’t just test until it finally works. Plan to test periodically throughout each year after you take this design to production, as data platforms are a fluid architecture and subject to constant change. A working BC test today might be completely trashed if an undocumented router change is not replicated to the DR equivalent. Incomplete testing almost always results in some setting being missed, which results in a horrendous experience if a failover for an emergency is required. Most successful business continuity strategies actually plan to do a full failover and run FROM the DR site for a portion of the year so that it is confirmed that the end-to-end details of a failover are proven to work.
Proper recoverable backups are the foundation to any business continuity strategy. If you can’t recover a backup, the rest of the platform is almost useless. You must start here, because recovering the data is arguably the most critical step. Your availability architecture flows from this step based on your SLAs.