This editorial was originally published on Apr 28, 2015. It is being re-published as Steve is traveling.
Early in my career in technology, I worked as a system administrator on a Novell network. Periodically we'd have crashes of servers on our network, or application failures, and while we understood that sometimes these were caused by transient errors, we often invoked a root cause analysis process when an issue repeated itself. I'm sure that the environment in which we worked, a nuclear power plant, contributed to the idea that we should always understand why some system failed.
I was reminded of this while reading Grant Fritchey's "Understand the True Source of Problems" in which he relates a few "best practices" that are likely folklore stemming from ignorance rather than actual good ideas. I've seen a few of these rules at various places in the past, often implemented because they appeared to work. However, we need to remember this:
Correlation != causation
Just because you perform some action and observe an effect doesn't mean that one caused the other. If that were the case, I'd never install new hardware or applications on my computer systems as I've had crashes during various installations. However I often back out a change, or retry it and realize that I had a coincidental result, rather than a related one. The same thing has occurred in more complex systems where an action appears to cause an issue, but in reality, the two items are unrelated, or loosely related.
We often don't take the time to determine the root cause of many issues, which is disappointing. While it often doesn't seem to be a worthwhile use of resources, I bet that often we'd learn there are actions we are taking (or code we've written) that actually is the cause. If we learned from our mistakes and could avoid making the same ones again, we'd greatly improve the quality of our technology systems, with many fewer issues over years.
Unfortunately, "good enough" is often good enough, even if it does result in a bit of downtime.