The Impact of Outages

Question

The Impact of Outages

Steve Jones - SSC Editor

SSC Guru

Points: 736212
More actions
July 31, 2011 at 9:02 pm

#149259

Comments posted to this topic are about the item The Impact of Outages

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

phegedusich SSCommitted Points: 1552 More actions · Answer 1

This points out the necessity of testing patches and updates. Our cycle is long enough to engage the vendor if problems arise in the test environment. My supported systems haven't been negatively affected by patches, but if they are, we are prepared to react and sustain the business with minmal outage.

This is possible due to our robust DR environment. This should be a must-have. If you have critical production systems in the cloud, what is your DR strategy and practice? How do you continue to operate when the cord to the cloud is cut? Pencil and paper shouldn't be the first option on your recovery list.

chrisfradenburg SSCrazy Eights Points: 9592 More actions · Answer 2

I really like the chaos monkey idea but I don't think it makes senses to do it everywhere. The benefit that Netflix has is that everything is supposed to be automated so the end users are the clients. I work for a hospital system so the users are employees (office staff, doctors, nurses, etc) and the clients are patients. Everyone agrees that our employees need to know how to operate without any given system and each department has plans in place for situations like that.

However, since the backup system provide a lower level of care I don't know that it's a wise idea to kill services and force the backup system into place. I think it's a good idea to tell people to act as if the primary is down and use the backup to confirm it does work and patients can be cared for. But if there is some issue I don't want to have to say, "Well, the DB got corrupted when the service was killed so you need to use the backup for another hour while we do a restore."

Alan Vogan SSCrazy Points: 2513 More actions · Answer 3

Where is the link to the Netflix piece mentioned?

Steve Jones - SSC Editor SSC Guru Points: 736212 More actions · Answer 4

Steve Jones - SSC Editor

SSC Guru

Points: 736212

August 1, 2011 at 9:27 am

#1361645

My apologies, the links have been fixed.

Eric M Russell SSC Guru Points: 125561 More actions · Answer 5

From what I've seen, most outages (an extended period of time where all or a large number of users cannot access the application or some key feature of it) are either network related (ex: the DNS servers for a corporation or a major ISP are down), or a 3rd party service provider (like a credit card processor) is down and causing problems for organizations that rely on their service.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Steve Jones - SSC Editor SSC Guru Points: 736212 More actions · Answer 6

cfradenburg (8/1/2011)
However, since the backup system provide a lower level of care I don't know that it's a wise idea to kill services and force the backup system into place. I think it's a good idea to tell people to act as if the primary is down and use the backup to confirm it does work and patients can be cared for. But if there is some issue I don't want to have to say, "Well, the DB got corrupted when the service was killed so you need to use the backup for another hour while we do a restore."

I'd argue that you this is even more important that you use the backup system. The chance that something won't work as well is there and you want to make a real simulation that the backup system can handle the load appropriately. People might not really test the backup system appropriately without is being forced to be the primary.

I certainly wouldn't want to unexpectedly make the switch with a chaos monkey, but I might schedule a few outages during the year and really test things

chrisfradenburg SSCrazy Eights Points: 9592 More actions · Answer 7

Steve Jones - SSC Editor (8/1/2011)
I certainly wouldn't want to unexpectedly make the switch with a chaos monkey, but I might schedule a few outages during the year and really test things

Right. Smoothly shut the server down. That way if there is an issue that can't be worked around without a lot of problem it's just a matter of booting the server.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

You can still schedule outages and practice for the unexpected. Depending on how critical the uptime is - perhaps you want to practice offsite.

I worked for a Life Insurance company about a decade ago. As part of their DR exercises once a year, they had contracted for a third party auditing company to come up with a series of "chaos monkey" type events, that would randomly be added to the exercise. So - they put together a series of sealed envelopes with scenarios: each biannual exercise would get one envelope.

Some of the envelopes would be empty (i.e. no twists), but some included:

- bad backup (the set we left with the off-site backup company was replaced with blank disks).

- "dead operator" (the DR manager was declared incapacitated and was not allowed to interact at all from the moment they read the note).

- faulty equipment, etc...

In other words, enough to make life really miserable if you couldn't think on your feet.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?