I did a post last month titled RTO and RPO are myths unless you’ve tested recovery, but I only briefly covered what RPO and RTO are. This post goes into a bit more explanation of those acronyms and their importance. First, let’s go back to my working definitions:
- Recovery Point Objective (RPO) – How much data can you afford to lose.
- Recovery Time Objective (RTO) – How much time do you have to get the system back on-line.
Recovery Point Objective (RPO)
We measure RPO in terms of time, not size. This is why some folks, when learning about the two, get RPO and RTO confused because both are measured in time. RPO is focused on data loss. At what point in time does the amount of data loss exceed what the organization considers acceptable? This is exactly what it sounds like: a business decision.
What is acceptable is going to be based on cost versus lost. We can get to near zero data loss in a recovery situation, especially a disaster recovery situation, when the organization. However, the closer to zero we get, the most it is going to cost. For instance, having Always On Availability Groups so that you have more than one replica requires Enterprise Edition. Why would you need more than one replica? If you wanted a replica in the primary data center as well as at least one replica in another data center so that you cover HA both locally and across data centers, that’s a minimum of two replicas. Basic Availability Groups are out and Enterprise Edition with Always On Availability Groups is required. Enterprise Edition is much more expensive.
Backups also have to be more extensive in the event that something goes wrong with the current installation. That requires more storage and more compute. Also, if we have an availability group across data centers, we need to minimize latency, meaning the network connectivity has to be top notch and sufficient for all traffic going across that line. That’s a cost, too. So business has to take into account the cost vs. amount loss and determine what it’s willing to lose vs. pay.
Recovery Time Objective (RTO)
This is a measure of how long it takes to bring the system back on-line so that it’s accessible by users. Again, given it’s a measure of time, it’s again a tradeoff between cost versus loss. In this case it’s the business lost while the system is unavailable. Therefore, it is once again a business decision.
There are a lot of technologies we can employ to help improve RTO, but some of our ability to reduce RTO is based on the solution itself. For instance, a few years ago I had to review the architecture of an identity management system. It only allows for a warm standby mode. Data from one set of systems to a system in another data center can only be done in an asynchronous manner. Previously it only supported log shipping but now, at least, an availability group in asynchronous commit mode is allowed. What isn’t allowed is for any services to be running on the application servers at the secondary site. Therefore, there is start up time required if the secondary cluster needs to be brought on-line.
If you’re thinking, “If I can detect the first system being down I can automate the restart of the secondary cluster,” that’s correct. However, the fact that the services have to be started means there is some downtime. That’s unavoidable. Also, the recovery scripts may be available, but the organization may want to have them ready to be run and not automatically executed in case there are occasional “blips” in connectivity between data centers. This would increase the time to recover the solution. What’s supported definitely constrains our options to improve RTO.
One last thing to say about RTO: even “zero down time” solutions can fail and things can become so bad that restoring from backup is required. RTO should consider this possibility. I’ve seen too many cases where RTO is calculated and reported based on a best case scenario. As IT folks, we need to consider reasonable obstacles to recovery when we communicate solutions for both RPO and RTO. If we are forced to restore from backup, how long will that take? Never assume that backups aren’t necessary. There’s a lesson on that from the security files about a huge worldwide organization was only saved because a single domain controller in a remote site happened to be offline due to bad power after a nation-state action destroyed their systems with notPetya. They weren’t targeted. They were just “collateral damage.”