Opportunity
For production systems,
uptime is a typical measure of how well we as administrators do our
jobs. Its not the only one, but its an important one. I was once
told that if a DBA keeps her/his database up, keeps it fast, and can
recover in the event of failure that they’ve earned their
salary no matter what it might be. While some may disagree with
this, it certainly encapsulates the role we all play in the service
we provide to our employers. This article focuses on the third item
– recovery.
Background
At the company I work
for, we have a centralized data center with many database servers and
several branch offices with a couple of database servers in each one.
A few months back, we began a project to assess where we were in
terms of data loss and recovery. After our review, we found that we
were in good shape in our data center, but realized that if one of
the servers in a branch office failed – not a disaster mind
you, but just failed – we would be in serious straits with our
customers who rely on our systems for data exchange. These servers
host databases that didn’t necessary require 24x7 uptime, but
an outage of more than a couple of hours would lead to escalating
fines and affect our ratings with our customers. What to do?
Redundancy
The first thing that
comes to mind for reducing the risks associated with downtime is
installing redundant servers. The idea is that if the production
server does fail, the application can be moved to another server and
availability restored. If you go this route, you have to be willing
to convince your boss that the application is critical enough that it
warrants letting thousands of dollars worth of equipment potentially
set underutilized. If you win this battle, then the next step is to
decide exactly how you’re going to achieve that redundancy.
To understand what
server redundancy really means, let’s look it up. According to
Wikipedia, redundancy can be defined as:
Redundancy,
in general terms, refers to the quality or state of being redundant,
that is: exceeding what is necessary or normal, containing an excess.
This can have a negative connotation, superfluous, but also positive,
serving as a duplicate for preventing failure of an entire system
Obviously, for our
purposes we’re taking the positive side of the topic and
considering those scenarios where applications must be available to
users whenever users are available to access those same applications.
But due diligence requires that we invest our employer’s money
as efficiently as possible. What are our options?
There are several ways
to maximize uptime for database systems, each with its own advantages
and disadvantages. In this document, I’ll try to review three
that we considered then try and explain why we selected the option we
choose.
The options we
considered were:
Redundant Servers – No Synchronization
Clustering
Log Shipping
Redundant
Servers – No Synchronization
Unsynchronized
redundant servers satisfy at least one requirement of maximizing
uptime by having replacement hardware onsite. This requires
purchasing duplicate hardware for single applications, in case of
catastrophic system failure. The advantage is
hardware is always
available, ready to be deployed as needed.
The disadvantages
include
doubling the cost
of hardware
time needed to
restore applications and databases from the production system
backups
Due to the trivial
nature of this solution, we didn’t see much value in going this
way, but we included it for the sake of completeness.
Clustering
Clustering was the
second method we looked at and is defined as a grouping of two or
more servers into a single logical entity known as a cluster.
Clusters typically come in two flavors –
Active/Active
Active/Passive.
An Active/Active
cluster is composed of two or more servers, with all servers in the
cluster providing resources to users. In an Active/Passive cluster,
one server in the cluster responds to user resource demands –
called the Primary, while the other is in standby mode – called
the Secondary. In the event of one of a number of predetermined
failures, the Primary server will transfer processing tasks –
or fail-over – to the Secondary. In our experience, this
transfer disrupts access to the applications installed on the cluster
– such as databases – for a period ranging from 5 seconds
to 2 minutes. The Active/Active configuration is more complex and
can be more expensive than the Active/Passive. Because we were
considering the needs of branch offices where no IT staff was
stationed, we only considered the Active/Passive type of cluster.
In order to evaluate
the fit for clustering at my company, a test environment consisting
of an Active/Passive cluster was setup. SQL Server was installed and
a typical production database restored. We ran a battery of tests on
the cluster and arrived at the following results:
Advantages
the cluster
performed “as advertised” by failing-over after the
occurrence of a failure event
the fail-over
itself took a very short period time – ranging from 10 to 30
seconds
the fail-over was
automatic, ensuring that in most cases, downtime was minimized
clustering uses a
single name and machine address, allowing applications to always
connect to the same address, regardless of which server is actually
servicing the request
Disadvantages
Setup and
maintenance of a clustered environment was considerably more complex
than other solutions. This includes everything from stopping and
starting services to stopping and starting the individual servers in
the cluster. This was a major drawback for us, again considering
that our solution was to be installed at all of our branch offices.
Cost –
clustering was the most expensive of the 3 options considered,
requiring not only redundant hardware, but an Enterprise Edition
license for Windows Server 2003 (WIN2K3EE) as well as an Enterprise
Edition license for SQL Server 2000 (SQL2KEE) for each server in the
cluster. In addition, the two servers in the cluster are
recommended to be identical, for obvious reasons.
Automatic
fail-over. In addition to being an advantage, this feature can be
problematic as well, especially when IT staff isn’t
experienced in dealing with clusters. Old habits are hard to break.
For example, when stopping and starting services, you must do so
from the Cluster management console. If you don’t, the
cluster will assume that a failure has occurred and fail over. That
was a painful lesson.
Clusters utilize
separate server hardware, but share a single disk drive unit. As a
result, clusters are as susceptible to disk failures as single
servers.
Log Shipping
Log shipping is the
third and final option we considered. Log shipping is intended to
provide a type of fail-over similar to that of clustering with one
major difference - the fail-over process for log shipping is manual,
involving a predefined set of steps to accomplish.
It’s essentially
a solution composed of two servers, with one server acting as the
Primary – servicing user requests - and the other as the
Secondary. The purpose of the Secondary server is to receive and
then apply database changes from the Primary transaction log at
predefined intervals. By doing so, the Secondary is synchronized
with the Primary up to the last interval. This interval is
configurable and for testing purposes we varied the time span from 5
to 15 minutes.
Log shipping is
actually an old technology and can be implemented in various ways,
from the very simple to the very robust. You can implement it by
“rolling your own” or by using the wizards included in
SQL Server 2000 Enterprise Edition (but not Standard). We decided to
use a home grown version that was written by Bill Wunder and
graciously provided for free on his website at
http://b.wunder.home.comcast.net/15758.htm.
These scripts are designed to be used with the SQL LiteSpeed product
which we had already purchased. Thanks Bill, you rock.
In order to evaluate
log shipping, a test environment was setup, SQL Server was installed
and a typical production database restored. We ran a battery of
tests on this setup and arrived at the following results:
Advantages
Once the
intricacies of setting up our log shipping routines were overcome,
setup was straight forward.
We were able to
accomplish fail-over in the test environment in a matter of minutes.
While this time could reasonably be expected to grow to the range
of 20-40 minutes (it is a manual fail-over after all), we considered
this to be acceptable. It’s important to note here that if
this duration is unacceptable, then log shipping probably won’t
work for you.
Manual fail-over.
Given our experience in managing our clustering tests, the argument
could be made that this is a definite advantage.
Single machine
name and address. Although – strictly speaking – this
was not a function of log shipping, by utilizing an alias, we were
able to eliminate log shipping’s primary drawback –
having to reconfigure applications to look at a new machine name and
address after fail-over.
Cost. Log
shipping can function using the Windows Server 2003 standard license
and SQL Server 2000 Standard license. You’ll need to purchase
two OS and two database licenses, but by using Bill’s home
grown solution and not using the Enterprise version of either
product, you still come out ahead.
Redundancy of
disks. Log Shipping employs two separate servers to accomplish
redundancy – redundancy that includes two separate disk
arrays.
One Secondary
server can service multiple Primary servers.
Maintenance is
much simpler than that required for clustering
Disadvantages
Manual fail-over.
The duration of system fail-over is longer than that experienced
with clustering.
Decision
For us, log shipping
provided my company with the best mix of benefit versus cost. We
decided to continue with Bill’s home grown solution, as it had
everything that we needed in a single package.
Implementation
Details
It’s important to
note the specifics of our implementation. We felt that one of the
major drawbacks to log shipping is the need to point applications to
a new server after the fail-over. Our solution to this was to use an
IP alias setup by our network group and point our applications to it
instead of the production server. This alias points to a specific IP
address – the same as the primary.
For example, given the
following server setup:
Production Server:
Name : PROD1IP : XXX.XX.XX.223
Secondary Server:Name
: LOGSHIP1IP : XXX.XX.XX.224
We create an alias
like this:
Alias :Name :
PROD1VIP : XXX.XX.XX.223
At this point, we
changed all references in our application to point to the alias, not
the physical server. The reason is that now, at fail-over:
We restore the
database on the secondary server (LOGSHIP1)
Change the IP of
the primary to an unused IP address
Change the IP of
the secondary to that of the alias
Once this is complete,
client applications can re-connect to the alias (PROD1V) without
realizing that they’re actually connecting the secondary server
(LOGSHIP1).
Of course, for any
fail-over to work, there are details to attend to, including
Synchronizing data
directories and paths
We synchronize
hourly, using a simple batch script and RoboCopy
Synchronizing jobs
We synchronize
daily using DTS
Synchronizing
Logins
We synchronize
daily using DTS
Server/Application
redundancy is one of those things that you put in place, then hope
that you never have to use in production. We’ve had the
occasion to fail-over using log shipping and its worked well for us.
Let me know if you have
any questions or comments.
Mark F. Cook