SQL Server 2005 has greatly increased the options for building a high availability database server. Clustering solutions have expanded, Log Shipping brought to the all editions, replication enhancements have greatly bolstered the possibilities with this technology, and of course, database mirroring.
With the release of Service Pack 1 in April 2006, database mirroring is now a supported technology that you can deploy in your environment, and is supported in both Standard and Enterprise editions. This is one of the more exciting additions to the SQL Server feature list and I know most DBAs are very excited to take a look at it.
While setting up a high availability system is much easier to accomplish technically, the decisions about how to architect the solution and which technology is best for your environment remains a difficult task. I am not claming to be THE expert on HA or DR, but I've learned a few things and there are many mistakes that are made over and over in all kinds of organizations. This article looks to outline some of the considerations and compares the various technologies in a little different way than most comparisons. Rather than examining the technologies, we'll look at the requirements you may have.
High Availability and Performance
The first thing that you need to understand with High Availability systems is this is an insurance investment to a large extent. In selling this to management or even in responding to their desire for "100% uptime", you need to stress that high availability will cost some money and there will be resources sitting idle.
This is a big point of contention with many managers. After they've written a 5 or 6 figure check for that cluster and see no performance gains, they'll be less than thrilled with the fact that half of their investment is sitting idle, unused, just waiting for something to happen that may never occur. I know that theoretically you will use the failover regularly and patch one system while the other is serving clients, but we'll look at patches later.
This is the case because often people confused high availability, usually clustering, with scale out and high performance. A high availability system is designed to prevent downtime. It is not designed to improve performance, increase capacity, or balance a load. Some HA systems might incorporate some of these features, but that is not the primary purpose.
Instead an HA system should prevent you from having extended downtimes if hardware fails or there is a software problem. Notice I say should because there are still plenty of places where a failure can force downtime.
In comparing the technologies for their ability to improve performance, we can examine them as follows:
Failover Clustering | The secondary server in a cluster, whether this is 2 or more nodes, cannot be used for any performance gain. This node can potentially support another SQL Server instance doing other work, but it doesn't help performance of the primary node. If you failover to an active node, you can potentially have worse performance. (See the cost item below for more on this) | Database Mirroring | The mirror database in a database mirroring setup is inaccessible for use by any other clients, so there are not direct performance gains. However, you can create a database snapshot on this mirror and use that for reporting or read-only queries against the database. However the snapshots can potentially age quickly, so the drop and recreation of these can cause issues in some environments. |
Log Shipping | Log Shipping has various drawbacks, but performance is not one of them. The secondary database in log shipping can be setup for access by clients, resulting in the ability to scale out your reporting or read only queries. Since the logs can be sent to multiple secondaries, you can potentially scale out to a local and remote site, having clients query the closest node. One note with log shipping is that it can be combined with the other solutions to gain some of the performance advantages. |
Replication | Replication is similar to log shipping in that the secondary or target databases of this technology are accessible by clients and can be used for queries. With merge replication, you can conceivably make your secondary databases read/write, resulting in a true scale out of the servers. Again, this can be combined with clustering or mirroring to gain these benefits. |
None of these makes a performance gain easily, but if performance is above cost or other factors, including uptime, then you should consider combining Log Shipping or Replication with your primary technology to improve performance.
Patches
When is your database system likely to go down? Obviously when you patch the system, service packs or hotfixes, but I'd argue that the most likely time when the system is unavailable is when the application is being patched. Either client code changes or database schema changes, this is the most likely time when your customers will not be able to use the system.
This is far more likely to happen than a hardware failure, fire, or natural disaster that might require a full scale disaster recovery plan execution. Those might be more serious, and depending on where you are located, much more likely than other places, but software is still responsible for far more issues.
If we examine the impact of patches on these technologies, we get the following:
Failover Clustering | Unless you are lucky enough to be able to afford a second cluster, it is difficult to truly test patches for the SQL Server system itself. You have two points of vulnerability. One is when you failover and the other is when you fail back and after applying a patch, you truly need to know these both work. This means applying a patch to a secondary node, then failing over to test that. Next you patch the primary node and then fail back. If you have an N+1 cluster, then the situation gets more complex because you may have multiple paths to test. | Database Mirroring | Failover with database mirroring is quick and fairly easy to setup. A patch might impact your ability to failback, but you could also set up a temporary, third server ready to go. Then if the failback didn't work, you could always break mirroring between servers one and two and move it to two and three, or one and three depending on how confused you are by this example. |
Log Shipping | Log shipping is probably the most tolerant of patching, or at least recovering from issues related to patches. Since you can setup log shipping to more than one server, meaning more than 2 in your set of servers, you can easily remove the failed server from log shipping and keep the application on the other servers, or even move a new server into the setup to replace a server having issues with patches. |
Replication | I think replication may be the least tolerant if something breaks with regard to patches. Switching roles between a publisher and subscriber is more cumbersome than the other technologies although this can be scripted and might be fairly easy for an experienced DBA to fix. Also this technology can have multiple subscribers for one publisher. Usually you upgrade subscribers first, the the distributor, and finally the publisher, which should eliminate and issues with failover, but it could result in issues if you cannot perform a failback and need to move the publisher and/or distributor to a new server. |
Costs
The big item for most people. After all implementing any of these technologies is not without cost, usually quite a bit over and above what it might otherwise be for a single server. However if you have existing servers, even if they are not SQL Servers, you may be able to leverage equipment to save on your initial costs.
Failover Clustering | This is the most expensive option. Special hardware, shared disks, duplicate hardware for the failover node, clustering usually is by far the most expensive option for high availability. Plus don't forget the training for Windows administrators and SQL DBAs. On the positive side, this solution fails over the entire instance, so it usually can handle any application. | Database Mirroring | A much less expensive option since you can use disparate, off the shelf hardware and it works faster than clustering! The downside is that the failover server may need to be larger than the primary server, especially if you have 2 or more databases failing over to it. Also applications might need to be reworked to take advantage of the database mirroring features with ADO.NET upgrades or SQL Native Client enhancements. |
Log Shipping | This should be the least expensive since it can use any SQL Server and it runs in all editions, including Workgroup, though not Express. You can have multiple failover paths, so as a local and remote failover technology, this provides some nice benefits. The downside is that to redirect clients can take time or require a more expensive network infrastructure to handle abstraction from the SQL Server instance. |
Replication | Also a low cost solution with using regular hardware and multiple failover paths, but also requiring the network abstraction as in log shipping. Another downside here is that typically in the other solutions if you have a server that is strictly for failover you do not need to license that server. However in this case replication is not listed here as an HA technology, so I would expect Microsoft would want failover servers licensed if implemented through replication. The administration cost could be high here as a large number of databases could result in some complicated replication schemes. |
Conclusion
This is kind of a basic look at high availability, but I hope from a different perspective than you might otherwise see. I tried to examine it more from a practical perspective, though I know there is much more to write than I've shown here.
If I had to choose a technology for SQLServerCentral.com, and I probably should, I'd look at database mirroring to fail to another local instance at our colocation facility and log shipping to get a remote failover going at another location. We're getting close to implementing that type of solution and I'd probably use some custom scripting to handle the job failover on the remote servers.