January 28, 2011 at 7:39 am
agreed - especially today with HA (high availablity) systems.
January 28, 2011 at 8:08 am
I agree with Dave62. The types should be counted separately for both hosted and in-house services. I disagree with the post earlier that in-house shops should be exempt because their systems are partially out. To paraphrase Yoda: system up it is or system down it is; there is no try.
For many smaller IT shops, any claim greater than 99.9% total uptime is an absolute farce. 0.1% downtime is 8.76 hours per year combined planned and unplanned.
What we really need is clearer, more realistic expectations. For most in-house shops, 4 scheduled hours per week (2.2%) is completely acceptable. Of course, they would whine if you told them this because they believe their reporting system is in the same league as Internet or phone service. These same people would be badly inconvenienced by 20 hours per year (0.22%) unscheduled downtime.
Please do not infer that I am saying we want 4 hours scheduled downtime per week. I just want people to realize that smaller budgets need to have more realistic expectations.
If I need 50 spindles plus $200k worth of servers to provide a system that gives what "the big boys have" but have 2 quad cores and 1 raid cabinet, then you have to change your expectations. Either performance or scheduled downtime has to give. Failure to acknowledge reality will create unhappiness and unscheduled downtime. (Yes, I am facing a client with this attitude right now.)
January 28, 2011 at 8:48 am
It counts.
Our IT dept gives estimated downtime for the year. We always meet management expectations.
However, we never meet user expectations. Usually because they couldn't download a 283MB video file and email it to 10 people in their group to watch or listen to Pandora uninterrupted while inputting data or their Excel spreadsheet didn't calculate correctly or their computer was plugged into the Surge Protection and not the Battery Backup or their Blackberry keeps malfunctioning or the printer is out of toner or paper.
If it plugs in or has a battery, it's an IT problem. We're doomed. :w00t: At least it's Friday.
January 28, 2011 at 9:30 am
We have an agreed maintenance window - once a month to apply any patches or application fixes. If you took that window away and wanted 99.999% uptime, we wouldn't apply those patches. So I ask what is best? An agreed maintenance window where things (might) be down while we are working on it, which ensures system security and reliability OR an unpatched, unsecure system that could be brought down due to the insistence that the application/database always be available?
Of course you could demand both but I would ask for a more or less unlimited budget / extremely high cost in return. So yes, it can maintenance downtime is still downtime but the more you demand, the higher the cost (and it's not a linear increase).
Even with todays HA options, a cluster requires a (max) 30 second failover and a database mirror requires a 3 second failover. Until SQL Server gets something similar to Oracle RAC, maintenance windows will be required.
January 28, 2011 at 9:52 am
CJ_Cerro (1/28/2011)
We have an agreed maintenance window - once a month to apply any patches or application fixes. If you took that window away and wanted 99.999% uptime, we wouldn't apply those patches. So I ask what is best? An agreed maintenance window where things (might) be down while we are working on it, which ensures system security and reliability OR an unpatched, unsecure system that could be brought down due to the insistence that the application/database always be available?
But you should just factor that into the SLA you are promising. There is nothing sacred about 5 9's. If you have an agreed monthly maintenance window, of say 2 hours, then the best you could deliver would be about 99.7%. If your users are already willing to agree to a monthly maintenance window, then selling them on something realistic like 99% or 99.5% shouldn't be a stretch (unless they are mathematically challenged).
But at least you'll be measuring the same thing that your users experience: the total uptime of the system. And you'll have some incentive to find ways to reduce your maintenance window as well as reduce your unplanned down time. Maybe you can automate the testing after applying patches so you can bring it back online more quickly?
If you start with an unrealistic SLA, then yes, you'll be motivated to ignore the planned down time so your reported SLA looks better. But if you start with a realistic SLA, then you don't end up needing to game the numbers.
January 28, 2011 at 10:57 am
I agree downtime is downtime just make sure your SLA and measurements are clear on that point and shouldn't be a problem. We should always be striving to reduce downtime planned or not as many of the other comments are referring to.
Great topic.
January 28, 2011 at 11:28 am
I don't have time to read all these posts at the moment, but my opinion?
Both statistics should be measured and published.
On the one hand, the uptime is measured in terms of 'unexpected'. This metric has it's own goals and solutions. You always want this to be minimal, regardless.
On the other hand, the maintainance downtime is a mixed bag. If you have zero downtime then likely, something is being neglected unless you have a wonderful clustering / mirroring solution which permits maintanance without downtime.
Finally, comparing the two, enables a good metric to determine whether problems exist or outages are something to focus on.
A single metric is a composite that provides little or no insight into reality.
January 28, 2011 at 11:37 am
If the business decision is to ask for 99.999% availability, and the funds are allocated to provide it, then it's up to us to plan for minimal disruption and fall-over for maintenance or failures. Numbers alone don't mean anything. A computer could be available 100% of the time for a full year .... when used as a paper weight. So, as others have suggested, if maintenance down-time is acceptable to the users, then it should be reported separately from unscheduled downtime.
January 28, 2011 at 12:53 pm
As old Shakey once said. "A rose by any other name is still a rose". Downtime is downtime.:-D
"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"
January 28, 2011 at 3:28 pm
No way should you classify scheduled and agreed maintenance, testing, implementation, or anything that has prior agreement with the users as "downtime"! Perhaps I just have an old fashioned interpretation of the word, but to me downtime is anything that is NOT planned and has happened as a result of faults in the system. Monitor both, yes definitely; have seperate SLAs, yes why not. But keep it seperate - people who look at these stats are more concerned about unplanned downtime; after all, these are the ones that need to be fixed. Lumping all downtime into 1 SLA can potentially hide the fact that there are real issues.
January 28, 2011 at 5:40 pm
For a 24 X 7 X 365.25 system, any downtime is downtime. However you should design your system (such as clustered servers) such that you can do maintenance without disrupting your service. 😉
January 29, 2011 at 1:59 pm
Full Circuits (1/28/2011)
For a 24 X 7 X 365.25 system, any downtime is downtime. However you should design your system (such as clustered servers) such that you can do maintenance without disrupting your service. 😉
Either that, or if that is too expensive of a solution for you, just mirror your important databases and perform rolling upgrades on them.:-D
"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"
January 29, 2011 at 7:06 pm
I believe it depends upon the operating SLAs (scheduled vs unscheduled). In our (BIG Pharma) organization the scheduled maintenance is not considered as downtime. The dashboards/metrics are designed to consider unscheduled maintenance as downtime. Coming to customer, they always consider the maintenance as downtime or not depending upon the length of downtime.
January 29, 2011 at 9:10 pm
Lynchie (1/28/2011)
No way should you classify scheduled and agreed maintenance, testing, implementation, or anything that has prior agreement with the users as "downtime"! Perhaps I just have an old fashioned interpretation of the word, but to me downtime is anything that is NOT planned and has happened as a result of faults in the system. Monitor both, yes definitely; have seperate SLAs, yes why not. But keep it seperate - people who look at these stats are more concerned about unplanned downtime; after all, these are the ones that need to be fixed. Lumping all downtime into 1 SLA can potentially hide the fact that there are real issues.
I agree with you in one aspect, Lynchie, but there are ways to track both.
The question isn't are we in trouble for the downtime, it's does it count against the end users? A true 5 nines system is a nightmare to actually create. Most systems that have that volume of data require some maintenance on occassion, requiring all sorts of hoop jumping.
That said, the downtime aspect of any system must be lumped together. There is only one set of people that care about what's happening on the backend. The end consumers. They only care about one thing: Is it up and running, or is it not working? It's a bit value to them. The more it's 'off', the more it's a problem.
However, because of that, yes, in an SLA the final 'downtime' values need to be the total of both planned and unplanned outages. Even if we track both separately. So, I guess I'll echo 90% of the folks in this thread.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 30, 2011 at 3:38 pm
Craig Farrell (1/29/2011)
That said, the downtime aspect of any system must be lumped together. There is only one set of people that care about what's happening on the backend. The end consumers. They only care about one thing: Is it up and running, or is it not working? It's a bit value to them. The more it's 'off', the more it's a problem.
Totally agree: downtime is viewed differently by various users: The end-users consider *any* time they cannot use the system as downtime (planned or unplanned). IT staff will make a clear distinction between maintenance and failure (end-user won't give a rats *ss about any such distinction). Management probably only cares about unexpected downtimes since they likely result in unexpected cost.
In other words: how you count maintenance vs failures depends on which parties the SLA is for (end-users vs IT-dep, company vs service provider, etc). But by all means publish your maintenance windows to the whole world (ram it through the throats of your end-user especially, they won't open up any SLA document, they just try to use the system at any time convenient to them and will throw a tantrum if it doesn't work no matter what the reason).
Viewing 15 posts - 16 through 30 (of 41 total)
You must be logged in to reply to this topic. Login to reply