Introduction: The Problem
No matter what line of work or stage in life someone is in, we have all experienced situations that required troubleshooting. We have all seen or performed “Good” troubleshooting. We also have probably all seen or executed "shotgun" troubleshooting. In technology that looks a bit like this:
· Happy Clicking – no purpose, just clicking without order
· Doing the first thing your search engine tells you to do (in production, without understanding)
· An air of stressed panic adding to the confusion
· Throwing the toolbox at the problem – trying everything whether or not it makes sense
· If lucky, a solution you can’t explain, reproduce or understand
These same missteps or analogues of them can be seen in just about any industry and across all aspects of life. Through this article, I hope to tackle the shotgun approach and lay out a pattern that can apply to SQL and really to most any issue.
A Pattern
Design patterns are used in technology to lay out a generic solution to apply to a type of problem. We can take that pattern and apply it to our specific situation with specific code. They are supposed to help cut down some design time and help give a palette of generic solutions to use.
If we can do this with some degree of success in software problems, why not do it with our soft-skill problems? At a recent user group meeting, I struck up a conversation with someone. The conversation went to me going off on a tangent about how our problems are the same as an electrical problem or the same as an ambulance call (I am a volunteer firefighter and EMT-Intermediate though I just gave up that license for more family time). If you can boil a problem down, start at the beginning and work through it methodically, you have the majority of the problem solved. The specific skills come with learning; it’s your troubleshooting ability that helps you apply those skills.
To demonstrate, I want to look at a fictitious ambulance call made up from pieces of the few hundred I have been on. We will go on that ambulance call, look at the steps taken by us, the crew, and then relate it to a pattern. Hopefully we’ll find corollaries to the SQL "day job" in the process. The numbers relate to the pattern described at the end..
Patient, Instance… Tomato/Tomahto
It’s 2AM; you are awakened by a high pitched tone coming out of your radio. The dispatcher's voice echoes: "Respond for the chest pain patient, 62 year old male with cardiac history. He has taken nitro with no relief of pain, difficulty breathing ". In the SQL realm, this would be our pager with an alert from a user or server. (1)
You start running the scenarios of this patient in your head. Asking what could cause this situation, what are possible diagnoses, what can happen to the patient? You are asking what the worst and best cases are and how do you handle each. (2)
As you and your partner rush to the house, you talk about who is responsible for what, what equipment you will want for this call. You talk about the severity and the scenarios you thought about earlier. You formulate your game plan so when you roll up on scene you take in what you need and you know how you will operate. You discuss that it is a possible cardiac call and you are not operating at the paramedic level and you may want backup from a paramedic unit. You let dispatch know to have one stand by or start heading that way. (3, 4, 5)
You arrive on scene, grab the planned equipment and head in. His wife answers the door in a panic, you calmly ask her where he is, what is happening and for his medication list plus any medical information (Doctor’s name, recent hospital discharge paperwork, etc). You approach the patient and ask him bluntly, “Sir what is your problem today?” yeah dispatch said it but you want to hear it in his words and you want to prod him for the chief complaint; he may be nauseous, dizzy, etc. but why did he call 911? What is the main problem that woke you up from your dream of query tuning? (6)
As you converse with the patient, your partner is quickly assessing the patient’s vital signs with the equipment you brought. The wife is panicking behind you like a nervous CIO in your cube as she reminds you, “Hurry up! What are you doing?! Let’s get him to the hospital!!” You reassure her and quickly explain you are getting the important info. (7) After only a matter of 1-2 minutes you have your info and have determined this is a “load and go” situation. Patient is going onto the stretcher and into the ambulance. (8)
In the ambulance you start an IV line to have it ready should he get worse or if the medics meet up with you and want to push medicine for his pain. You do a quick electrocardiogram and the printout is pointing to what your questions and partner’s vital sign checks point to: heart attack. Your differential diagnosis is based on what you perceive and what your training and equipment verifies. (9, 10)
Your partner begins driving to the hospital when the patient screams out and stops responding to any stimuli. You look at the monitor and it shows a rhythm that won’t sustain life. You verify this with a pulse check: no pulse. You ask your partner to stop and get back there with you, unresponsive patient. He radios ahead to the medics and climbs in the back. You go through the protocol for cardiac arrest as a team. Your partner beginning CPR, you preparing the defibrillator by placing the stickers you had ready onto the patient’s chest. This is a rhythm that can respond to defibrillation so you prepare to do so. You ask your partner to clear the patient, verify you are both clear, verify a second time and deliver a “shock” of electricity through the defibrillator. After a couple rounds of CPR, defibrillation and medicine through the readied IV, the patients pulse returns. The paramedic unit arrives, a paramedic with their added tools and training jumps on board as your partner begins to drive towards the hospital again.
The rest of the ride is uneventful. You transfer care to the ED staff, document your call and prepare the truck for the next crew. (11, 12)
A Pattern Emerges?
Looking at the call above we see some themes emerging in each paragraph. Correlating to the numbers in the story we see:
1. Gather initial information
2. Prepare your mind
3. Work as a team
4. Plan your attack
5. Don’t be afraid of asking for help, better for the patient and you in the long run. Don’t be afraid to ask for that help to be ready early. Give assistance time to respond even if not needed in the end.
6. Formulate a problem statement and verify it with all parties! Far too often I have sat through meetings with confusion, frustration and no traction on an issue only to discover we weren’t all on the same problem.
a. We also looked at the entire picture and didn’t develop tunnel vision. In the ambulance this looks like: focus on a flashy symptom and missing the root cause or big problem. In the database world this actually looks exactly the same: looking at a symptom and missing the root cause/serious issue.
7. Remain Calm
8. Understand priority and if the issue is stable or declining
9. Anticipate changes and plan ahead for worsening – Readying that IV on a patient who doesn’t need the meds yet is like taking a backup in the initial stages of an issue. If things go good, you don’t need it and it wasn't expensive to do. if you needed it you will always wish you had it. It’s more challenging to start an IV on someone with no perfusion, it’s practically impossible to backup a database that is completely gone.
10. Use all of the information – Don’t formulate an opinion based on opinion or “feel” alone. Verify your thoughts with actual information. Look at your monitoring tools, your error logs, etc.
11. Documentation – We hate that part of ambulance calls too. It’s necessary. Helps paint a picture of what went wrong. Forms a part of the patient’s medical record and covers us should the patient decide to sue. Same in the DBA world: Root cause analysis, reference for the next time and describes what we did in the heat of the moment for change control/auditing purposes.
12. Clean up and preparation for next call – We restocked and prepared the rig for the next call. On an ambulance this is more about preparation. In the database world this step really looks more like preventing the next call
So now we can play Johnny and Roy, what about SQL?
Hopefully it’s not too huge a stretch to see the pattern emerge. The order may be different in the database realm but the principles really aren’t different. When a problem emerges it may be a temptation to just start trying “stuff” but if you can think about the above steps you should be able to work through the problem, figure it out and work on a solution and be able to reproduce that same success with future problems.
One principle I couldn't outline from the medical call is a benefit to us in SQL Server, most of the time; Non-Production systems. We couldn’t give meds or CPR to a replica dev or test version of our patient. We had to rely on training and the steps to feel good about the proposed solution (which is all the CPR, Defibrillation and cardiac drugs were). In our world, we can and should try to replicate a problem in a non-production system and verify results. Not always expedient to do so but when we can, we should.
Had the course of treatment not improved the patient we could have looked at other protocols. Perhaps other drugs, different energy settings on the defibrillator, cardiac pacing, etc. could have changed the outcome. It’s the same when troubleshooting an out of control server. The first thing you do, even if reasoned and methodical, may not be the solution. Don’t be discouraged. Have a backup plan, understand why it may not be working and change course.
Conclusion
Had we performed some of the sorry examples of technology troubleshooting I have seen with this patient it would have looked like this: Checking the internet for chest pain, trying the first remedy that comes up (regardless of it's danger or benefit); Throwing our drug box at the patient, injecting him with every liquid in the ambulance; shocking the patient when it wasn’t prescribed; running around saying “oh no! This isn’t good!”; crying when the patient’s wife told us to hurry up; or anger towards our partner with the patient still suffering. Yes this was a critical call but we still slowed it down, stepped back and applied a methodology. The extra 1-2 minutes in methodical approaches made the difference. Having that time back with a worse outcome seems silly.
Rushing through a critical situation will only make it worse in the end. Take the extra time to understand the problem and understand your solution path, even with the CIO pacing the cube.