SQL Server Availability

  • As part of an annual review we've now added availability as one of out KPI's. Question is, how can i actually report on this? I know there is system uptime but this will reset if the services restart for whatever reason. Is there anything i can check / use to report how much downtime the sql service was down for in a month?

    Or am i asking too much and its a case of timing how long the a failover takes etc etc....?

    any help much appreciated

    _________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie

  • Try to make use of the logic that 'tempdb' database gets created evrytime when sql server service is restarted so log this in a central table and report on how often sql server got restarted or so in a month. Just a suggestion, dont know whether it can help you or not...

    MJ

  • Along the lines of Manu's suggestion...

    Create a table that holds a single datetime value and run a job that updates that value every minute.

    Create another table that will hold two datetime values.

    Create a procedure to excute on SQL startup that inserts the current datetime into the second table, and also pulls in the most recent time from the first table.

    The different between those two values will be the amount of time that the server was down. You can then measure the total downtime (or uptime) of the server over any given time period.



    Shamless self promotion - read my blog http://sirsql.net

  • You need to monitor this as Nicholas suggested so that you have values. By having each server instant monitor itself, it means that there aren't external dependencies.

    Be sure you include this table/db in a DR plan so that it is restored if the server has to be moved. This ensures you can tell how long each instance took to recover.

    I might also add a "reason" field in the 2nd table and try to update this if you have extended downtime.

  • Just a few thoughts from my time at a large insurance company...

    We never used any "self-monitoring" for a SQL Server instance, as that would always rely to some degree on that server being available to provide the data, and having to gather such data from a rather large number of instances presented it's own challenges. The company also had a rather large investment in a problem management system, as well as a significant mainframe environment where Tivoli was installed, and which was configured to have agents on ALL servers (in addtion to all the mainframe images, and not just SQL Servers on the distributed side). TSM took care of automatically opening a case in the problem management system when it detected a server down problem, or even a service down problem. This would force a response from the staff that manned the data center 24x7, which would often include a page to off-hours server support if the monitoring software didn't determine that the server was back up and running in fairly short order, or that the operations staff didn't make that determination.

    Measuring availability was thus done by a report I created and ran on a monthly basis to determine the length of any given outage as determined from numbers and date-times in the problem management system cases that "qualified" as outages. We had very specific criteria related to the SLA (service level agreement) for "availability".

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • The trouble is that there really isn't a perfect monitoring solution.

    Tivoli for example may look at an application being down and associate it to SQL even when it's not a SQL issue. It may not correctly poll SQL and even though it's up send an error.

    There are a great many monitoring issues, what it really comes down to is what you can afford, the level of alarming you require, whether you are looking closely at SLA items and the like.

    Even a simple uptime analysis is difficult if you are running in a cluster as polling intervals on the monitoring side could miss the failover and so downtime would not be recognized or accounted for (of course others use agents that would report even this scenario).

    Ultimately nothing is perfect, you just gotta find what fits your needs and budget.



    Shamless self promotion - read my blog http://sirsql.net

  • I guess I should have added a few of the caveats we had recognized...

    Tivoli most certainly wasn't perfect, and I can recall getting more than 1 off-hours call at oh-god hundred only to discover the allegedly down server up and running and not having any trouble at all. Cases generated from these instances and other similar situations were marked as "no outage", so as to avoid stinking up the stats. Given that there were human being present to monitor both Tivoli and check to ensure that it's reporting was accurate, we also knew that many such cases were generated and no support call was required because those folks checked with the tools at their disposal and discovered the good old "NTF".

    We also know that a server re-booting itself in the middle of the night for no apparent reason was better left alone, as the cost of trying to deal with it exceeded the value, so while the TSM monitoring was designed to cut a case after detecting a down time long enough to not likely just be a re-boot, the operations folks could cancel the case if the server came back up shortly after the case was created, and thus avoid a lot of unnecessary grief. These incidents were tracked and if a given server would repeat the behavior, then a case was cut manually and we'd investigate the cause. It was a relatively rare situation for any given server to repeat such behavior.

    Steve

    (aka smunson)

    :):):)

    Nicholas Cain (1/7/2009)


    The trouble is that there really isn't a perfect monitoring solution.

    Tivoli for example may look at an application being down and associate it to SQL even when it's not a SQL issue. It may not correctly poll SQL and even though it's up send an error.

    There are a great many monitoring issues, what it really comes down to is what you can afford, the level of alarming you require, whether you are looking closely at SLA items and the like.

    Even a simple uptime analysis is difficult if you are running in a cluster as polling intervals on the monitoring side could miss the failover and so downtime would not be recognized or accounted for (of course others use agents that would report even this scenario).

    Ultimately nothing is perfect, you just gotta find what fits your needs and budget.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I've been that human being (well if you can call me that) that sits there overnight to check on alarms and make sure that they are valid before notifying someone.

    It's a thankless task, and I am so very glad that I am a DBA now.



    Shamless self promotion - read my blog http://sirsql.net

  • I used to write my own monitoring tools in which a main server would page if a server is down. It also write to a table on the main monitoring database, and you could write comments on why it's down.

    Becarefull on what you are reporting and committing to as we all know there are unschedule down time and schedule down time. I think the schedule downtime should not be included as part of your performance because it's when you and all other parties agree that you could take it down for maintenance purposes.

    Good luck.

    Sopheap

  • Thanks to all that replied to this thread, your thoughts / ideas are gretly appreciated. I'm gonna go for the 2 table and one job running every min route, simple to implement and will do exactly what i want.

    Sopheap Suy, the KPI is only going to be measured during office hours ie 9am - 8pm Mon-Sat, this will allow for any potential maintenance which may involve downtime to not affect the results.

    Once again, cheers to everyone!

    _________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply