Today we have a guest editorial as Steve is traveling.
Recently I joined an effort to reduce the run time of a distributed process from 12 hours to less than 2 hours. As is almost always the case, we started with less than perfect information on how all the pieces worked together. Our understanding at the start was that scale out was possible and that the database portion was going to be a bottleneck. So much so that we discussed using read only instances to increase our throughput. Assumptions!
We did the first couple tests with a single database server and the service scale out in place. For me it was a ho-hum exercise, with 10-15% cpu usage on the database server. I looked at the queries and found that for each iteration there was one that was long running query, sometimes over a minute, but it also returned a fair amount of data - a million rows wasn’t unusual. I started working on the query, but based on everything else, it didn’t seem like we could get a big win without a redesign that wasn’t doable in the time available. It also didn’t seem like this query was the bottleneck. Interestingly at that point we couldn’t see anything that looked like a bottleneck - nothing was close to maxed out.
I saved off all the captured statements to a table (yes, I used Profiler) and started to look at the process more. Being able to sort and aggregate made it easy to see that as a percentage of database time, this query was by far the biggest chunk of database time, but that was only one part of the puzzle. There were other data stores involved. Was this the bottleneck? If so, how to prove it?
Luckily for us the one slow statement was a stored procedure. I changed it to return a Top 1. Doing that meant everything else would work even though we wouldn’t get the correct answer in the end. It’s a hack for sure, but a reasonable one in our scenario.
Discussion challenge: how would you do the same if it was submitted as dynamic sql?
This was an easy change to make, easy to revert, and in theory would let us iterate faster because the slowest part would be faster.
This was the bottleneck, or at least the first one, but that got us going. More importantly, we could put that problem on the list and keep going without figuring out to how fix it (though we had a couple ideas already). Continuing on was important because we still didn’t understand why weren’t getting the performance we expected on the rest of the hardware. A few more challenged assumptions later we understood the configuration options, the hardware usage, and had a list of scenarios to run to validate what we believed. That doesn’t solve the problem, but it defines the problems to solve clearly. It’s downhill from there.
The lesson? Maybe the main one is that it’s rare to figure it all out in one go. Investigate, iterate, repeat. The faster you iterate, the faster you prove or disprove what you think will make a difference. I believe that faster iterations helps you stay focused on the end goal because it’s way too easy to get bogged down in tuning something that just doesn’t matter.
While you’re thinking about it, think about this - how do you react when the problem is hard. The can’t find a place to start kind of hard, the hours and hours into it and still no progress kind of hard. Do you stress out? Blow up? Give up? Grind through? Or can you smile at a challenge worthy of you, enough to make you exclaim “Watson, the game is afoot!”?