Check The Simple Things

  • Saturday morning was my scheduled move of the SQLServerCentral.com servers. So I wake up early and head over to the co-location center for Viawest. I get there and a couple other people are working, but it's pretty quiet. So I go open the cabinet, front and back, and start shutting down the non-critical servers. Test server, user group server, and the mail server. While they're shutting down, I go check out the new cabinet. I'd gotten an email on Friday with the location and combination, but hadn't checked it. It opens easily enough, but no shelf.

    Grumble, grumble. There was supposed to be a shelf installed to make things easier. So I call support, but since it's 8:20 in the morning, they're not in yet and I can't find my customer code to get router to the 24/7 center. So I decide to start moving a few things, a plastic bucket that holds spare cables, the non-critical servers, etc. I get them staged by the new cabinet and install the 1/2 shelf I'd brought from home.

    So I get that in along with a load balancer that's been sitting in my basement. I'm walking through the cabinet to get from front to back because it's empty other than a shelf up high for the monitor and because the other two guys are in the way. On one of my trips, I'm coming from back to front, having installed the load balancer to the rear. I duck under it and slip through the cabinet.

    Actually my legs go through pretty well, but the top of my head bounces nicely off the 1/2 shelf. When I regain my senses, I'm leaning against the next row of cabinets, holding the top of my head and wondering why bells are ringing in the data center. One good lesson for those of you working in the data center: walk around the racks.

    I finally find an employee who brings me a shelf and 4 screws. I set them down and move my two rack mounted servers. Actually they're all rack servers, but I only have 2 sets of rails. I get them installed and then measure for the new shelf.

    And I realize it doesn't fit. The screws are too large for the holes. For not checking when I got the shelf, I get to spend 20 minutes on hold until I get through to support and request assistance. I get 4 big holes drilled at my spot to support the shelf. I guess I don't be reconfiguring anytime soon.

    With that shelf in, I can move most of the other non-critical servers. I get those moved and uncover one of my two shelves in the original cabinet. Move that, move a server, move a shelf, move the last server and plug everything back in.

    Now I've moved things a few times, so I know how confusing cabling can get. I'd left everything plugged into the switch and labeled the ends of the cables so I knew which servers they went to. It made it easy to plug everything back in. With all the lights on, I logged into each server and checked connectivity. I could ping each of the servers from the other servers and the firewall. So I fire up a browser and head to Google and ...

    Nothing.

    Can't seem to get out. The PIX firewall we use is lit up on all the lights, but I'm no expert on what to do here. Fortunately my friend Jordan shows up. We were using some fancy Cisco managed switch that he wanted to get back for his CCIE studying and he brought me 2 replacement 3COMs gathering dust at his place. It's good to be connected to geeks.

    So we replace the switch, check all the pings and he fires up his fancy DELL Linux laptop and plugs it into the Viawest port to sniff traffic. We don't see anything and assume the port didn't get moved correctly from the other cabinet. We call support and hang around a bit while we're on hold. Finally we get someone and I give him the phone since I'm not the networking expert. They talk about things and try a few tests before he says, let's move the cable to the switch and check continuity.

    It doesn't work and we realize it's a bad cable. Replace it and it works. The simplest thing once again delayed our move about 30 minutes and we didn't check that first. Since I was using the same cable that never was unplugged from the PIX, I didn't think of it, but I should have started there, not assuming that it wasn't damaged somehow in the move. But everything's moved and I made a little money to show for it.

    Along with an egg on the top of my head.

    Steve Jones

  • It's always the last thing that is checked, at least I stop there anyway!


    Kindest Regards,

    The art of doing mathematics consists in finding that special case which contains all the germs of generality.

  • Done it, lived it, and I have the T-shirt as well as several scars to prove it.

    We once went through about two months of calls and escalations to a vendor's top tier support because users kept getting timeouts from the server, only to discover that the server NIC was slowly dieing and had been the whole time.  Fifteen minutes after replacing the NIC, the problem was gone forever.  And I was pretty red-faced for some time.

    Another weekend I spent way more time than I should have trying to get a laptop to attach to the network for an upgrade, only to discover (eventually) that it worked fine from my partner's cube...just not mine.  Seems that someone did a clean up of the network closet the week before and just decided to unpatch any port that did not have a device attached.  Which meant that my network drop I had set aside for Laptop upgrades was no longer even patched in. 

    Last, but certainly not least, during a recent upgrade I smugly thought I had all of my checklists ready, had walked through the upgrade and the vendor documentation several times, and I was completely and thoroughly prepared.  What I failed even give a thought to beforehand however, was that the PXE server where all of our user/client device images are stored was almost out of disk space.  This is a heck of a thing to find out the morning AFTER you've already upgraded all of your servers.  Fortunately, I was able to find a couple of outdated images and purge several gigs of drive space, but if not I probably would have had to rollback the app install on 3 servers (not too mention performing the DB restores, rescheduling the downtime, and severely irritating several members of the management staff).

    Yep, the little things get you every time even if you think that you've been diligent.  But the really important things in these situations are to be patient, avoid panicking, and think things through.  If you're resourceful, you can usually find a way to hack your way around even the ugliest problems. 

    Just my (hard earned) 2 cents for what they're worth. 

    My hovercraft is full of eels.

  • Entry needs a header: "I wasted a lot of time today using a bad cable. If you feel your time is well spent by reading about my frustration, continue.  If your time is valuable, go on to something else."

  • Firstly I must say that I had a chuckle... But it was still a little harsh!

    I quite enjoy the non-technical editorials... afterall, their editorials, not white papers!

     

    If I wanted to be constantly consumed by technical documentation I could always go and read the RFC for ... say SGML... or WebDAV... or (insert some overly technical specification here!)

    (hope the noggin's a lot better now!)


    Gavin Baumanis

    Smith and Wesson. The original point and click device.

  • Absolutely superb  Steve - once again you have us in stitches  over what should have been a simple move

    We all have hazardous objects in the server room- yours is the lack of headroom.

    In mine- some well wisher sent me some plastic shelves - which are so lightweight they tend to sway with the slightest movement - sending everything from the top shelf - assigned for hard copy manuals ( but usually piled high with girlie magazines!) onto the unsuspecting head of anyone nearby.

     

    Has anyone else got a hazardous item in their server room?  ( DBA's excepted of course)

  • I'm still laughing at "DBA's excepted".  Usually in our case though it's a vendor or a CNE that does the most damage.

    And while we're on the subject of stitches I have the following to relate: 

    You need to excercise a lot of caution when working with raised floors.  If you remove a tile, you need to either put up signs or blockades, or stay in the area until you're done to make sure that no one falls.

    Recently some genius started turning the lights out during the evenings in our 24/7 machine room to save energy.  Someone else was working on some cabling in the afternoon one day and took up several tiles without replacing them.  Later on that evening, one or our analysts went in and while hunting around for the light switch, took a bad fall where a tile should have been.  She suffered a gash to her shin that required nearly 50 stitches. 

    Fortunately for our management, she's not a litigious person and her injuries weren't worse, or we'd probably have a whale of a lawsuit on our hands.   

     

     

    My hovercraft is full of eels.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply