Recently I was participating in a round of storage rearranging with a client (aka the storage dance) which consisted of temporarily moving the log files to a different drive and verifying the original drive was empty, then handing it over to the network team to remove that drive, add space to the data drive, and then add back a new log drive, followed by me moving the logs back to the log drive.
Not hard work. Ahead of time I reduced the size of the logs so that the copy over and back would take less time, reviewed my list, and made sure we had backup on yet another drive, just in case. Simple though the change was, it was on the server, one that had to be working by 1 am, so I had about a 3 1/2 hour cushion just in case, but the expected outage was an hour.
The work went smoothly. Moved the logs,restarted the service (I’m conservative),the disk change went smoothly, move it all back, and then rebooted (yes, I’m conservative – I want to know I can do a cold start). Starting checking while I resized the logs back up to a decent size, and found one critical service had not started. Hit start, it failed. Fun. Looked in the event log, fairly unhelpful error about FTP not being started. I knew that the service moved files so I suspected it was an internal error and not another service, started looking for clues.
Didn’t take long to get to the config file on disk, and at the end, way way down, was a setting called “running” and it was set to No. Thanks, right? But there was a comment in the file that said it was normally only changed by the application. At that point it made sense, the service had thrown an error due to the database being detached, and rather than continue to poll and throw errors had disabled itself via a configuration file. Not a bad idea, though I’d call it – in hindsight – not fully baked. Adding a line to the event log explaining the action would have made this easy to diagnose, as would an email explaining the same. I set running to Yes and hit start on the service, success!
In hindsight I goofed by stopping the service and not disabling it. Without that error we would have been clean in 45 minutes, instead it was another 15 minutes checking and looking to figure things out. I don’t know that I should have anticipated this particular problem, I would just expect the service to fail and quit, or fail and keep failing.
For all that it wasn’t a stressful event for me. I had changed very few things and knew what they were, I had a machine in development that I could check the configuration if needed, and I had the luxury of time. Time to figure it out, time to call support if needed.
One more reminder that even the simplest changes can have unexpected consequences.