March 30, 2011 at 10:19 pm
Comments posted to this topic are about the item The Chance of Failure
March 30, 2011 at 10:35 pm
Good article Steve.
I live in Christchurch, New Zealand and I have to admit that I thought the possibility of losing an entire center here was extremely remote. However recent events (two major earthquakes in 6 months) have seen this risk realized, unfortunately.
I wrote two blog posts about this recently
March 31, 2011 at 1:31 am
Large disasters are one thing, but programmers are a lot of optimists who rarely check for error conditions as trivial as the following
If Flag = '1'
do
.....
else /* assume flag is '2' */
....
end if
Rather than
If Flag = '1'
do
.....
elseif Flag = '2'
....
Else
Error in Program
...
end if
March 31, 2011 at 5:36 am
I think Dinosaur comics summed up failure best with
Failure: it's just success rounded down
which is a healthy way to look at it.
March 31, 2011 at 6:15 am
"Hope for the best, plan for the worst..."
Generally good advice if you are talking about sizable companies, data centers and the like, but in one's personal life (if you are a techie) this can get out of hand.
I have always had a network in my home going as far back as the early 80's, and I have always gone to great lengths backing up lots of data. Now, some 30 years later, I have multiple external drives and two very old machines that I use solely for backing things up. Problem is, I am still (to this day) backing up some stuff that must be close to 30 years old and though I constantly promise myself that someday I will review what I am backing up, I never seem to get the energy to go through that task - and instead, I just buy yet another big external drive and keep backing up.
Why do I still save program code from languages now dead? Why do I continue to hold onto Windows fonts that date back (I think) to Windows 3.11? I have close to 20,000 digital images and I still back them up, seemingly sure that one day I will go through them and actually throw away the pictures I inadvertently took of my own feet.
Sure, hope for the best, plan for the worst... But sometimes the worst is self-inflicted, and while I have spent decades ensuring that my "main machines" are clean and up to date, I now have more space occupied in backups than I do in actively used data.
And why didn't I get rid of stuff as the years passed? Well, though I am fairly sure that DOS-based Ryan/McFarland COBOL is not going to make any comeback in these days - if it does (!!!), I will be ready with the programs I wrote for it back when I didn't have grey hair and wasn't a member in good standing of AARP...
March 31, 2011 at 6:55 am
Google has the right idea. Failure WILL happen and the issue is to design around it. Their approach is resilience
Unfortunately this reality has not fully found its way into the IT world. Failure should be considered an unfortunate byproduct of system operation and handled accordingly.
...
-- FORTRAN manual for Xerox Computers --
March 31, 2011 at 7:40 am
Thanks for the good reminder, Steve, and the links.
This may be covered in the links (haven't read them yet), but I recall reading a similar topic where it mentioned that large companies such as Google invest in a huge amount of what they call commodity hardware -- decent but not top of the line equipment that is designed to be replaced when it fails, rather than a system where all the eggs are in one basket of a single high-end setup. Still requires a large budget, but it does make sense from a disaster mitigation point of view.
Also -- and I am by no means a math expert, so please excuse any errors -- my understanding of the failure rates of systems is that the chance of a failure on any given day for modern equipment may be quite small, but the chance of failure over a longer time is almost certain. So it's only a matter of time before parts fail. I guess it's only a short step from that realization to the realization that if you don't really know when the part will fail, you always need to be ready for it to fail and have a backup plan to handle it.
Just my two cents....
- webrunner
-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
March 31, 2011 at 7:59 am
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?
For single drives, how often is an online disk check (much less an offline disk check) done?
For data, how many people actually generate checksum data (if you use Zip, raise your hand) to detect some corruption*, much less ECC data (par2 or ICE or DVDisaster, etc.) to detect and correct some corruption*?
*I almost never see real corruption; the typical "I can't explain this issue [optional: but a reinstall fixed it]; some file must have gone corrupt" that I actually check out shows no corruption at all; more commonly is a configuration change of some stripe.
I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.
March 31, 2011 at 8:08 am
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 31, 2011 at 8:17 am
Left out the the 3rd rule...
Hope for the best
Plan/Prepare for the worst
Always, always, always have a plan B!!!
March 31, 2011 at 8:34 am
Nadrek (3/31/2011)
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?...
I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.
0% downtime is extremely expensive. One needs to be realistic. Most apps can tolerate occasional downtimes of varying degrees, and it's a lot less expensive to evaluate those needs realistically.
Interestingly, we used to have a clustered RAID SQL2000 installation. Every system failure we had was in the RAID control system which meant that the clustering did us no good whatsoever in the downtime area. (Fortunately the RAIDs did not lose data during those failures). We have a mirrored system now on completely separate hardware.
...
-- FORTRAN manual for Xerox Computers --
March 31, 2011 at 9:51 am
Eric M Russell (3/31/2011)
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.
It might be easier as you say, but you would be surprised at how many shops do not plan for it by not even just backing up their databases up on a regular basis. It would astound you. I have seen shops I have gone into in the past that have not backed up their system databases in over a year!!! . When I asked why? the response was "Well we didn't need to, everything works just fine and if it isn't btoke we don't fix it... " :w00t: Absolutely incredible, there are bozos out there in IT like this but there are. . Just because DR has gotten easier doesn't mean people are doing it.
"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"
March 31, 2011 at 10:09 am
If Joe's Bike Shop looses all their data without a backup, then that's a personal tragedy for their business, but it's not really a community or regional wide disaster. The chances of a government office losing all your tax records or a corporation permanently losing all your mortgage paperwork is practically unheard of.
Well... the mortgage company may sit on your escrow account refund for weeks or months claiming they "misplaced the paperwork", but they can find the records at any point, if they really wanted to. That's not an information technology issue.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 31, 2011 at 3:17 pm
This seems like rehashing old news. I have vague memories from a database class I took in the 80s where we talked about an airlines data center (AA or United or somebody big like that). Like I said, my memory is vague on this, but I believe they talked about Mean Time Between Failure of the disk farm on the order of 5 minutes (translation: there will be a disk failure about every 5 minutes).
The technology has changed, but the problems are still there and have to be dealt with.
March 31, 2011 at 3:25 pm
Tom Bakerman (3/31/2011)
This seems like rehashing old news. I have vague memories from a database class I took in the 80s where we talked about an airlines data center (AA or United or somebody big like that). Like I said, my memory is vague on this, but I believe they talked about Mean Time Between Failure of the disk farm on the order of 5 minutes (translation: there will be a disk failure about every 5 minutes).The technology has changed, but the problems are still there and have to be dealt with.
I'm now picturing some poor kid wandering the halls of the data center with a giant shopping cart of new drives in front of him, just slowly meandering down the aisles looking for red lights with this zombie-fied look on his face at 3 in the morning.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 15 posts - 1 through 15 (of 30 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy