Whoops

  • Whoops

    When I first started working in IT, I was a Novell Netware network admin. I worked in a nuclear power plant, running the network for the administrative staff there. We tried to keep things up 24x7, taking only one break a quarter: midnight to five am. Fun.

    We had one server that did something; at this point I can't remember, but we wanted a quick outage to do some preventative maintenance one day. My boss knew it wouldn't survive until our quarterly outage and we figured it would fail at a bad time, given Mr. Murphy's presence in our lives. He really tried to convince the station manager, but we didn't get the outage. So one Friday we left early and went to get an early dinner. My boss told me we had to do one more thing at the station, so we went back just as the shifts were changing and it was one of the slow times.

    And he told me to turn off the server!

    I couldn't believe it and hesitated, but he said it would be ok. I did, we started the maintenance and his pager erupted on the desk. He pulled out his cell phone and called back, getting informed that a server was down: the same server I was working on. He said we'd take a look and hung up. We finished up and called back to let people know things were ok.

    Later he told me that it's easier to ask forgiveness than permission, and this was one of those times. I was reminded of that story by this blog post with a similar story. The content of the story is actually worthy of another editorial, but it's a similar situation.

    Years later I had a server that was constantly failing, like page-me-in-the-middle-of-the-night failing on a regular basis. We needed to replace some equipment, but despite the failures, the boss wanted it to be running as much as possible, not wanting us to replace stuff.

    So when a fire alarm occurred in the building, I saw my chance. I walked around, checking all sides of the building for smoke (we were on the roof), then checked the stairwells. As people streamed outside, I decided to stay with the admin working for me. We turned off the server and started replacing equipment, and then copying data to the new drives. My phone rang.

    It was the boss down in the parking lot. Where was I, couldn't find me, some application went down for a client and as soon as staff was in the building, he needed it up so they could help the client. I said I'd get right on it.

    And shortly after they got back to the office, we had a server running again 🙂

    PS - Support Katie and her family : Purchase a raffle ticket today for a chance to win some great prizes.

  • While I sympathise with this approach, I think that we are setting a bad example to the future generations of IT Professionals.

    Let me preface the rest of this email by saying that I have done this too. And once or twice it has come back to bite me somewhere painful! A brand new disk added to an array that goes faulty within minutes. Incorrect parts in the box. something else failing at exactly the wrong moment. It happens.

    The relationship between the "Business" and "IT" has been strained for many years, and this kind of behaviour only goes to reinforce the claim that IT professionals are arrogant, loose cannons; incapable of taking direction from the business and determined to do what they want regardless of what the business requires.

    The only way to get the "Business" to see IT as an enabler rather than an obstacle it to show them the risks associated with doing things the way the suggest. Its a longer-term approach, but it pays in the end.

    In the article you linked, I would suggest that a less "aggressive" approach would have been to analyse what email gets used for. Show the business exactly how they will be afected by a 72 hour outage. In this particular case, I have no sympathy for Wilbur. Any IT professional who doesn't warn the business adequately that email (and SMS messages) are an inherantly unreliable for of communication (they may not arrive, and they may arrive late) should be taken out and have these things explained. Repeatedly. Using large, heavy objects!

    In short, my thoughts are that this highlights some of the worst habits of IT. We need to break these habits and listen to what we are told, not what we want to hear!

  • Steve,

    I salute your braveness. I would've never dared to do such thing. Probably we're in different field.

    My murphy law, the easiest thing could go wrong! Its better to go wrong planned than unplanned. Partly about me afraid losing my job, but I've considered my action might've caused serious consequences. Sometimes, even when I was doing a planned outage for DB maintenance (not really a major ones, it was just reindexing a VLDB, but involve shutting down 16 servers and bringing them up again), I get tremendous pressure doing so. Probably because the system is accessed by so many user (thousands of users will thank me for giving them break time if i screw-up doing simple work!)

    I'd rather spend time arguing with the manager, provide them with evidence, etc than doing something behind their back. Of course, at times i can make decision and inform my manager when things "were" already done but I know my boundaries. If they're unwilling to approve an outage time, then a system breakdown would teach them to be kinder to the servers.

    But so far my managers had been very understanding and supportive. Never had to reach this point. I think its bad if things reach this point where techies had a make a (corporate) decision that they're not comfortable with. It gives them tremendous pressure beyond their scope of responsibility. I'd not risks that for my precious sleep

    Simon

    Simon Liew
    Microsoft Certified Master: SQL Server 2008

  • I agree... bad message to promote and, depending on what the server is doing, if you down it, you could cost someone their life (seriously... 911 control server, traffic light sync, ambulance dispatch, etc, etc).  Even if it's not that serious, you better be very right and very prepared because if the server doesn't come back up, you've got a heap of explaining as to why you should be allowed to keep your job

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just one word ... WOW!


    * Noel

  • FYI, that plant wouldn't have been 3 mile lsland would it?    I have been a support person for way too many years.  going on 20...  whew!  In all those years, I took a lot of risks with fixing problems proactively as you were encouraging here.  Not once did I get any sort of recognition for preventing a failure that of course never happened since I fixed it.  However, I nearly got my behind tossed out the door a few times because what I thought was an EASY fix turned out to cause headaches for many for hours to come.  

    Take it from my personal scars.  Present your case.  Go above your manager if you feel strongly about it...  Document your case as well as it can be documented, but under no circumstances do you do something without PROOF that you were given the go ahead to make a change to a system.  CYA!!!! The DBA is always the fall guy.  Don't make that easy.

  • Steve said "administrative" not "mission critical".  You just would not turn off the controls with even a fraction of an inch of rod sticking out of the top of the SRM's.

    Of course everybody thinks their app is actually "mission critical".  It's critical to their mission.

    I've done stuff like this too.  We had a loose wire strand in an audio connection that casued loud crashing sounds every time you walked near the transmitter.  Early one morning before air time I yanked the main power and opened up the master panel.  After cleaning up the audio wiring I put the transmitter back together and powered up.  The frequency control crystal was very slow comming back to temperature and so we were a tad off frequency by air time.  This might seem trivial but there can be jail time associated with a stunt like this.  Luckily we were within the regulations, but just barely.

    I work with food producers that do refrigerated shipments.  Turn off my shipping server and I have trucks full of meat with no where to go.  Same thing with the pallet scanner.  No scanning means meat in boxes at 55F that should be at 35F gaining temperature minute by minute.

    I have impromtu outages during software upgrades but they last seconds and I make them happen without warning.  Auto-restart is built in to the apps because, as the bumper sticker says stuff happens.

    ATBCharles Kincaid

  • The funny part is this is actually the exact type of things I've thought of doing many times at my work.

    The owner of my company refuses to allow me to move our corporate database off my personal work machine (100+ GB worth of different data,etc) as well as purchase me a test server.

    So I had been thinking of what it would be like if I simply took the database offline for a couple of hours on a Thursday and left it off for an entire afternoon (and blame it on bad code that I had written or something) just to prove the point that a) the corporate databases shouldn't be on a work machine, and b) that a test environment is a very crucial piece of equipment. 

    Not to mention all the slowdowns I get trying to do my work, when other people are accessing large volumes of data from the db.

     

  • As an FYI, in the nuke plants, the critical controls, meaning the reactor controls, were only using "certified" ancient 1960's technology from Westinghouse and others, with actual tube transistors in places and we had a 1MB disk drive that was literally 18" or so across. You could see it spin. All this stuff had been highly tested and was redundant, and so it was used because everyone knew it worked (along with manual backups).

    While it seemed the PC stuff worked better, it wasn't necessarily reliable (can't have blue screens or anything else) and was mainly used to make notes, take readings, and non-critical stuff. Important, but not critical.

    I appreciate the comments and I agree that you have to argue the point and really try to drive things home, but what about when that doesn't work? What if you are sure something will break, or see it degrade. Is one day of arguing with the boss enough? One month? 6 months? At what point do you take things into your own hands? I'd submit a day is way too little, but I'm not sure about the others.

    Part of this is risk (which needs it's own editorial) and part of it is prudence. If I know, because of past history, that I've got between 1 and 4 weeks before a server fails and needs a reboot and I've got end of quarter processing in 4 weeks, what do I do if I can't get an outage? I can let the business decide it's worth it, and I have done that. Or I can look at the whole picture and decide that my boss and his boss are being short-sighted and they can't make the decision because they don't understand technically the problems as well as their impact. I've done that do. It's a balancing act and I've felt it was worth the risk of my job at the time.

    I've also felt it wasn't worth it at other times and let things fail.

  • I think it's awesome you would do this, Steve! I work in software development so I've encountered the need for this, but in its own way. I've learned it is sometimes better to just do what you think is right. It may sound arrogant or smug to some but as IT professionals sometimes we really do know better. For software I've learned if you don't menton some changes you've made to a system no one will even notice or care about the change, but as soon as you mention wanting to make it everyone gives their opinion and nothing gets done.

  • Ha! I have done the same thing. Sometimes you just can't fix it after hours because you need 3rd party support that is only available during "normal" hours. Better down for an hour then a day or a week. No rewards for be proactive. I'm fortunate because here, the price one must pay for being reactive is greater than being proactive.

    It's also better to have a controlled crash then the unplanned one. And for those that say they can't bring down a mission critical server, then you are already living on the edge of a huge problem by not having a redundant or backup server in place. Good time to practice your DR plan.

    On a side note, I had a friend who worked at a nuclear plant as a programmer in another country. A coworker and then friend of his wife mysteriously died from plutonium poisoning. The same kind used at the plant. She did not work there. They eventually figured out why he never drove her car. They found it installed under her driver seat. The husband went to jail. My friend would not say what happen at the plant after that but his former country was known to have men in black arrive at night to eliminate problems by methods other than judicial.

    Talk about a lax in security.

    My friend left the industry and the country after that and settled in the US working in IT. Brilliant mind. He suddenly disappeared in December 2004. I traced him as far as South America.

  • Terry:

    There is an old saying that if you don't need to re-boot your PC at least four times a day then you aren't doing anything significant.  I'd download a copy of Steve Gibson's wizmo program, write a batch file that calls it with the reboot parameter, and set it as a scheduled task every 90 minutes.   Oh, even better, write your own little program that calls the reboot at random. "I dunno boss.  It just started doing that.  Of course it does a full disk check on an unexpected reboot.  Is this going to bother anybody or should we just live with it?"

     

    ATBCharles Kincaid

  • If your boss and his boss don't understand then you aren't doing your job.  It is your job to make them understand enough that they can make a decision.  It is entirely possible that their decision is being based on something that you don't know anything about.

    Having said that, get the no in writing.  I do all requests by email so that I can show that I tried to do my job and was refused.

    And sometimes, especially in small businesses, you just have to get stuff done.  One way to do this is to not ask, but tell.  Don't say:  "Can I take the Mail Server down for 4 hours next week?"  Say:  "I will be taking the Mail Server down for maintenance on Tuesday from 8-12."  If no one objects, then go ahead.

    JimFive

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply