August 13, 2015 at 8:34 am
I've got a query which is executed very frequently. Sometimes it runs with a sub-optimal execution plan. The details of each execution plan are not really all that relevant, but in essence one plan uses the ideal index and another uses an almost ideal index. Since the query is called very frequently, the second-best plan ends up translating into significant increase in CPU.
I can resolve the problem by manual intervention to flush the bad plan from the cache. That led me to investigate why the bad plan is being executed in the first place. What i found is that the moment the plan enters the cache coincided with an update statistics job updating the exact stat that would be used for that query... the stat that corresponds to the "good" index. It turns out that this same thing has happened the past few times our update stats job ran. It's like the stat was not available because it was in the process of being updated when the query was run, so it used another index instead.
I dont want to use an index hint for the usual reasons. Is there any solution to this? What about creating an identical statistic on the "good" index so that even if the one stat is being updated, the other is available for use by the query?
August 13, 2015 at 9:38 am
Generally what's going to happen is that a recompile event will occur after the stats are updated. You won't see a recompile occur during the statistics update and use a partial set of statistics. If you're not using asynchronous statistics updates, the statistics update will block any compile or recompile event until the statistics update is complete. It could simply be that you're getting a recompile before your statistics update, or, immediately after, that is passing a parameter value that leads to the problematic plan.
The best solutions are probably all related to the same things you'd do to fix bad parameter sniffing. This includes using an OPTIMIZE FOR query hint to supply a specified value, if a particular value makes a difference, or NULL if having a generic plan makes a difference. You could use a recompile query hint so that you get a different plan each time, and that way the one set of parameter values that lead to the plan that is bad for most of your queries can still run. You could try updating yours statistics more often if the bad plan is coming because the stats are getting stale (but you'd have to determine if the recompile to a bad plan is occurring before or after the stats update). I'm not a huge fan, but you could also look plan forcing to keep the same plan in place no matter what.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2015 at 9:51 am
Thanks for the reply. I'm skeptical about it being a case of a "normal" scenario where it's choosing a plan thats appropriate for the first parameter executed after the recompile. Every time I manually recompile it always gets the "good" plan and I can think of no scenario where the "bad" plan would be superior.
I dont want to make it recompile every time- the proc is called so frequently that I'd be concerned about compile locks.
I suppose optimize for unknown might be a solution if you're correct that the cause is the parameter it's getting immediately after updating stats, but something my gut tells me that's not what's happening here.
August 13, 2015 at 9:53 am
On second thought, I guess your explanation is more likely than mine (though I really cant figure out why it would choose the other index and always the right one when I manually recompile).
August 13, 2015 at 11:35 am
Capture the plan when it's gone bad and when it's good. Look at the Properties for the first operator. You'll see the compile time values used. That should let you know why it's doing what it's doing, or least give you a good clue.
Also, UNKNOWN may or may not be preferable. Test that to be sure. It's the idea of going after a specific plan or a generic plan.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2015 at 11:43 am
thanks again. will have to wait until the next time it runs (probably Monday) to get the compile time options.
It does choose the "good" plan with Optimize for Unknown.
August 13, 2015 at 11:49 am
Interesting. Then it prefers to use an average of the statistics and a generic plan, not a specific plan based on a value. Sounds like it might be bad parameter sniffing. But it could just be stats being stale. Please post back when you know. I'm curious now.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2015 at 11:56 am
I think what happened was what you initially suggested- after the recompile, it compiled based on the first call made. I'm guessing that call had a outlier value for one of the parameters. For example one of the parameter is "roleid". 90% of users have role=0, 9% have role=1,1% have role=5. If the first call happens to be one with role=5, then it probably caches the "Bad" plan. With Optimize for Unknown, it compiles based on a generic plan, which will be slightly more costly for queries with role=5, but since those are only 1% of the workload, that makes me happy.
August 13, 2015 at 12:11 pm
That makes absolute sense.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2015 at 12:26 pm
Yep- the only thing that's a little suspect is that this has happened 3 times on 3 consecutive Mondays after updating stats and/or reindexing. It's hard to believe that all 3 times it hit an outlier after the recompile.
I think what happened the first two times was slightly different than what happened the third time. I didn't capture any details the first time, I just fixed the problem by clearing the plan from the cache (I was a little to quick to act as CPU was approaching our "alarm" levels), but looknig at history I could see that the problem started well before the update stats job runs. The second time I captured some info about when the plan entered the cache and found that it happened while the "good" index was being rebuilt. I figured the plan was cached when the good index was offline due to reindexing, so I changed to online reindexing for that table and thought the problem would be resolved. The third time, I was pretty annoyed to see the problem again. This time it coincided with stats update as I mentioned above.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply