Well, this is part four of my comments on starting and stopping SQL and
related items (Part 1, Part 2, and Part 3). It's a seemingly small thing until you have to do it for the
first time in production, or something goes wrong. This article will wrap up my
efforts with a look at some ideas that may save you pain.
How long does it take to restart the service? How long to reboot the
computer? Knowing these two items will absolutely increase your comfort level
when you have to do it 'for real' and beyond that, is often critical in making
good decisions. For example, with my current employer we strive for 24x7
availability, but we can easily schedule as much down time as needed any evening
after 9 pm with only a few hours notice. We can also two fifteen minute and one
thirty minute windows (breaks and lunch) during the day that we can use if we
can't wait until 9 pm. Let's say that you have a spid that is hung and you'd
like to restart the service to clear it. Can you do it in 15 minutes? Can you
reboot the server and have it available again in 15 minutes? I'll argue that
this is worth sticking on every data server you have as you may not be the
person doing it - and wondering if something has gone wrong!
Along those lines, let's say that you decide that you're going to stop &
start the service at noon. You send out email letting everyone know that you'll
begin promptly at noon. At 11:55 you're sitting at your desk watching the clock
and playing Solitaire to pass the time. At 12 sharp you right click the server
in Enterprise Manager and select shutdown, then sit back to wait for the
shutdown to complete. At some point it completes, you right click and select
start, then wait for recovery to complete. Easy enough, right?
Yes and no. Yes, you can do it that way and no, you shouldn't do it that way.
Let's look at what BOL has to say about shutdown:
- Disables logins (except for system administrators).
- Performs a CHECKPOINT in every database. However, if you stop an
instance of SQL Server using CTRL+C at the command prompt, it does not
perform a CHECKPOINT in every database. Therefore, the next time the
server is started, recovery time takes longer.
- Waits for all Transact-SQL statements or stored procedures currently
executing to finish.
Know imagine some knucklehead (ah, I mean user) has started at query at 11:55
that will take 20 minutes to run, what does that do to your schedule based on
the third bullet point? Checkpointing is a good idea, but again, you don't know
how long it will take. Together, they represent a fair amount of risk that you
won't meet your time window. Instead, I recommend the following strategy:
- At 10 minutes prior to shutdown, turn off SQL Agent. It can take a
minute or two to terminate depending on what is already in progress, and you
don't want it starting something just before you initiate the shutdown.
- Email the reminder that the server will be unavailable beginning at x.
- At 5 minutes prior to shutdown, run a checkpoint manually in each
database. In practice you may need to start it sooner if you've increased
the recovery interval beyond one minute, or you may able to start it later
if it doesn't take long to complete. You don't have to do this step, but
it's an extra layer of insurance to make sure as much as possible has been
written out.
- Issue 'shutdown with nowait'. This will immediately shut things down
without doing a checkpoint. Note that doing so doesn't mean you'll lose
data, it means that rather than cleaning up on shutdown, you'll have to do
it on startup. The checkpoint we did manually (and outside our window) will
reduce that cleanup to some degree.
- Immediately restart the service using whichever of the methods I've
discussed that you pre fer.
- Restart SQL Agent
- Notify everyone that the server is again available. Document the down
time and reason.
Neither of the above methods guarantees that users get a clean shutdown in
their applications. For example, let's say that your users have an OLTP type
system - order entry perhaps. Typically the application will connect to the
database, get what it needs, then close the connection. When the user saves the
changes, it reopens the connection and issues commands to persist the changes
(hopefully in a transaction). It's entirely possible that they will have started
their session at 11:59 successfully, work three minutes, then try to save at
12:02 and get an error. At that point you have two hopes. One is that users are
trained to not try to save until they are told it's ok (low tech yes, but valid)
or that the application is sturdy enough to give them a chance to retry the
operation until it completes. I'm not saying you shouldn't do the shutdown, just
making sure you see what else is occurring.
Let's change gears a little to a different scenario. Hopefully you're
familiar with how to restore the master database. Briefly, it involves putting
SQL into single user mode by starting it from the command line with the -m
switch. Do it on a practice machine and it's reasonably easy. Not so easy on a
server. Why? Well, they don't mention this in BOL, but you're not the only user
trying to connect to that server! Typically there are tens if not hundreds of
applications, jobs, reports, etc, all failing or being retried while the service
is down. When you start in single user mode there is no guarantee that you'll be
the single user! I ran into this for the first time just this year, replication
agents from another server were connecting as soon as the service came up. Easy
enough to stop them, but there may be other items that are not so easy to
control.
Hopefully some of this four part effort has been useful to you. I look
forward to reading your comments.