It should come as no surprise that the first topic I am covering this week is communication because the first thing I think anyone should do is communicate that they are troubleshooting an issue. This post will cover why we should communicate then dig into how to put together an initial alert. The rest of the post will be spent talking about how to communicate updates and the resolution.
First and foremost communication prevents your management from being caught by surprise when the VP of Sales calls to ask when they will be able to place orders again.
Communication also prevents duplicated efforts. Many times when a system is down trouble reports come in from everywhere and go to everyone, resulting in a situation where there is no clear problem definition or problem owner. Communicating the problem owner allows the information to flow to a central place, allowing the problem to be properly defined.
Finally, communication allows people to speak up about a recent change. If another team made a change recently they may be able to identify aspects of the issue you are working on that may be related to what they did. This is not saying someone was doing something sneaky although that sometimes happens. Usually this means that something was done and communicated to all the right people but not fully understood by the people it was communicated to or lost in turnover between support rotations. Assume the best here because you need people to speak up sooner rather than later. Treating them badly when they do speak up will only cause trouble in the long run.
So what is the best way to communicate that there is a system issue? It helps to have an email group that includes all IT on-call pagers, management and other key people. If you do not have one I suggest setting one up solely for communicating large issues. It is important not to spam this list, treat it like pulling a fire alarm. It should only be used to communicate system issues from discovery through resolution with updates at regular intervals or large milestones throughout the process.
It is also very important that the distribution list for these emails is IT only. People outside of IT may not know the intricacies of your particular implementation leading to the possibility of spreading misinformation. They are not doing this on purpose, they think they understand and like being involved in something exciting; that they are helping get the word out; doing their part. Let your management craft the organizational communications, they will have to answer for what is said in them.
When sending alert emails keep the subject line general. If you give too much information in the subject line then people can assume it is not their issue and move on. We want to take advantage of that little pit in their stomach that everyone gets when something breaks to get them up to speed on the issue.
Finally, the body of the email should provide a broad overview of the issue including what systems are impacted, any major symptoms including error messages, the number of people impacted and any location specific information. It is very important to keep to the important points here. The body of the email must be short enough to be fully read while long enough to include all important information.
The body of your email should also contain a listing of any resources you need. If you need people then say I will be contacting the primary on-call from network engineering etc. to get their attention. Never use a mass communication to say “I cannot find Mike from the server team. If anyone sees him please tell him I need his help on this issue.” It will make both of you look bad and make the person on the receiving end less likely to help.
Finally, only state the facts when communicating an issue and never assign blame. This is such an important part of the communication. It is important to only state what you know. What you think is not important and who is to blame is even less important. In the end the person that fixes the problem will be asked to explain what went wrong. Chances are they either made the change that led up to the issue or know who did. If there was a hardware failure they will be able to explain it in-depth as well. If you have any doubt about what you are communicating then check with someone that knows more about that particular area.
Once the first alert is sent you generally have 30 minutes to either fix the issue or convene a war room. 30 minutes is a loose rule that I use because if the fix is easy then you will almost always identify the issue, develop a plan and fix it within that time. If 30 minutes goes by and you are still trying to figure out what is wrong then it is time to ask for help. Either way a follow-up alert should go out at the 30 minute mark to update everyone on the issue. The update should follow the same rules as the initial alert although the subject line should be prefixed with “UPDATE: “. Updates should continue at regular intervals until the issue is resolved.
At some point all issues get resolved and that also needs to be communicated. The resolution should include the subject line of the original alert prefixed with “RESOLVED:“. Due to the potential for wide distribution, the alert should never mention anyone by name. The alert should contain a factual description of what the issue was and what was done to solve it, because facts are just facts and cannot convey opinions. Conclusions on the other hand, can convey opinions. Put the facts out there and let people draw their own conclusions. All that really matters is that the people in a position to prevent such a thing in the future recognize what happened and take actions to either prevent or mitigate the impact in the future.
I hope you can see why I think properly communicating issues is important. I have outlined a system that has worked well for me. I strongly believe that anyone handling communications in this manner will be recognized for their professionalism and leadership.
What works for you? Please feel free to leave it in the comments below.