After seeing several cases in the past couple of months where I felt the basics of troubleshooting were violated, I figured it was time to do a post on that again. After all, some of those cases had me as a main culprit. So this is as much a reminder for me as anyone else.
Know What the System Looks Like When All Is Well
I know I blogged about this just recently when I talked about looking at packet traces, but this is really key if you have an existing system. If you know what a system looks like when it is healthy, then you tend to hone in on those things which are different when the system isn't. And that may help you figure out the problem faster. For a new system, this isn't obviously possible. But what can be done is spending time looking at existing systems and understanding what they do. For instance, if you're trying to troubleshoot a new Kerberos authentication/delegation scenario, having looked at and understood why an existing system works and what the characteristics of that system are may help when you're trying to find the problem on the new system. I can't recommend this enough. This is my #1 secret when it comes to troubleshooting and it's one I have found a lot of administrators neglect the most.
Keep Good Logs and Audit Trails
Probably 90-95% of the time you won't need them. But when you do and you don't have them, you will wish you did. We ran into an issue last week where the logs we had pinpointed the root cause, how it happened, and why it didn't affect some systems that we would have expected it to affect. It also told us what the root cause DID, so we could then detect other systems the problem happened on that just hadn't showed up yet as a production problem. These logs were invaluable. They gave us the information we needed quickly, they told us what we needed to fix, and they told us the time frame for when the problem happened. This is my #2 troubleshooting secret. I use the logs and whatever audit trails the systems I work on provide me. I try to keep them large enough so they go back a long enough period of time (at least a week or so) so I can track the original source of a problem.
Never Assume You Can Exclude Anything, No Matter How Likely
I was burned on this one recently. I was told that something had been excluded when I was troubleshooting a major issue with a Citrix environment. The problem here is I assumed that it was excluded properly. In other words, the folks had done their due diligence and checked it our thoroughly before throwing it on the scrap pile. I was wrong. They hadn't. And I shouldn't have, either. Because of this bad assumption, I probably worked 8-10 more hours than I would have had I said, "Show me the evidence leading you to exclude." If there is no evidence, there is no exclusion. Period. I don't care how likely it is. You keep it on the list until you have enough evidence to sufficiently rule it out. This is tied for #3 on my list with the next one.
Always Start with What You Know Recently Changed
Most of the time when problems crop up, it is because of something you know changed. Start there. Those are your most likely culprits so start collecting the evidence on them. If there is sufficient evidence to rule them out, then do so, but don't make the bad assumption that it's an unlikely cause and try and find the problem elsewhere. Always start with what changed and rule those things out before moving on to try and find the problem elsewhere. Again, because of the bad assumption I made (and others made), something that had changed was ruled out and we were looking everywhere else.
Document Every Change You Make While Troubleshooting
This is a biggie because if you make a change and it doesn't fix the problem, you likely want to roll back the change. That way you're ensuring that you're back to a state where you can identify the real issue, not the real issue exacerbated by changes you've made after the fact. Also, if you've got people rotating on to the problem, they are going to need to come up to speed on what has already been looked at, what has been tried, and what changes exist in the system when they rotate on that weren't present when the problem first manifested itself. On a recent problem, I was faced with the fact that other folks had been working up to 12 hours but no one had really documented what changes had been made. So the first hour of my time was spent trying to ask everyone what was different in the system than when the problem first popped up. Had I had a document, something as simple as a text file on a file share, that time would have been better spent. This is #5 on the list, not because it isn't important, but because you have to do the other things to know what to look for, which is an integral part of the troubleshooting process.
Never Stop Looking at Your Own Area
That is, unless you can rule everything out. In one of the problems I attacked recently, the issue was assumed to be the network. I didn't think so. I thought it was with a connectivity component interfacing with SQL Server. Some folks assumed that because communications was slow between a client and a SQL Server (over a linked server connection) that it was the network and wanted to hand off the problem. I know my networking guys. I like my networking guys. They're excellent troubleshooters and when I work with them they don't respond with the old Air Force Tech Control line of, "The problem is at the distant end." They bulldog the problem right with me until we find a solution. They never stop looking at their own area when I'm involved. I give them the same courtesy. And like I said, I didn't believe the issue was a network problem. So I asked one of the guys, a really smart CCNP, to give me the statistics on the interfaces between the two servers. Everything was clean. Step 1 to ruling out the network accomplished. Now on to step 2. We copied a large backup, larger than the result set that was taking forever, back and forth. Seconds. Seconds? Step 2 complete, the network was ruled out. Meanwhile, I was digging wait states on both sides of the connection and found it said: OLEDB. I fired up other linked server connections, created on the fly, and pinpointed that one of the servers was having issues. The other one was just fine. And we continued to do the things that would rule out the network... like ensuring communications was on the same subnet to rule out routing, etc. Had I stopped and said, "It's not me," and kept pointing a finger at the network, we'd never have diagnosed the correct culprit. It was me, or at least, my server and specifically my component on the server since we consider that under the umbrella of SQL Server support. This, and the last basic rule, I rank together at #6. They tend to go together.
Never Be Afraid to Look Back Over Something That Has Been Looked at Previously
There's something to be said for endlessly going back to something and spinning your wheels. That something is: waste of time. But when you're stuck, when you're out of leads, it doesn't hurt to go back and look at what you've looked at before, especially with a different set of eyes. Perhaps something was missed. On that problem I joined 12 hours in and spent another 12 hours myself on, it turned out we had missed something. Of course, it was something that should have never been ruled out, but the fact of the matter is one of the server administrators went back and looked at a system specifically for things we might have gone over before. And sure enough, he found the culprit popping its ugly head up. That which had been ruled out prematurely was found to be the root cause. But if he had dismissed going back and looking at something that was covered before, we might still be troubleshooting a major outage (if we weren't already unemployed).