What causes a query to suddenly start performing poorly

  • SQL Server 2012 SP3 CU9 Standard 64 bit running on Win Server 2008R2 (Hypervisor)

    This morning the server that hosts an application database was up to 100% cpu. This application is used the same every morning for the last decade.

    Initially what i saw was many sessions sitting with waits and all running the same query. The server had been rebooted and all the queries returned.

    When i analyzed one it indicated a missing index. The query was doing an index scan on a table and the where clause was checking a datetime field to see if it was between a range.

    I added the missing non clustered index, one field, with 3 included columns. Within 30 seconds everything was back to normal. All these sessions disappeared and the server CPU was normal. I did a schema compare against  our test and UAT versions, the only difference was the index i just added.

    Nothing was changed yesterday, the only thing that happened was at 1:15 am the maint job that runs Ola Hallengren's index optimize was run. I checked the log and on that particular table it only updated statistics.

    My assumption is that the execution plan for some reason changed. Also when i ran SpBlitzFirst i saw a message that Plan Cache was recently erased, but the server was rebooted and i assume it was due to the reboot.

    The logs appear completely normal and were only showing regular backup activity.

    Any thoughts on what can cause a query to be OK on Thursday, and very far from OK on Friday?

     

  • Most likely the data volume in production reached a threshold which caused the change in the execution plan based on the SQL internal standards.  Unless your test/uat servers have the same load/data volume/etc. it can be hard to detect these outside of production without periodic/constant checks.

     

     

    • This reply was modified 5 years, 5 months ago by  carlanders.
  • I restored the DB from yesterday and also from the time we were having the problem this morning into the test environment which is running the same version of SQL.

    In TEST the queries from both databases look the same and perform the same.

    We also have a UAT version in production and comparing the exec plan between that and test i'm seeing that test has a bunch of Parallelism (distrubute streams) and Parallelism (repartition streams) but none of these on the prod server.

    Both servers have the same settings for parallelism:

    cost threshold for parallelism 5

    max degree of parallelism 0

    The main difference is the amount of memory on both, which i guess could have an impact on exec plan based on what is cached. The query passes in 2 datetime parameters.

     

     

     

     

  • Based on the fact that all execution plans are thrown away during a reboot or restart of the SQL Server Service and you said the server suffered a reboot or restart, someone may have run the proc where only a small number of rows came into play and when others run it with a large number of rows expected, SQL Server may say "Hmmmm... I already have an execution plan for this query" and use it even if it's not nearly optimal.  This is the "bad" kind of "Parameter Sniffing".  You may simply need to "recompile" the stored procedure and then run it with a different set of parameters.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • thanks Jeff, this is actually a linq query from the app i think the parms are relative to the current date so they do change daily but return a similar number of rows on each execution.

    When i restored the DB to test the exec plan was using parallelism and the hit on the one index scan was 5%. The state it was in that morning in production was not using parallelism and the index scan was around 40%

    After i added the index then the query ran fine and was also using parallelism again. Nothing really made sense, it was like that morning for some reason SQL changed its mind on how it was going to process that query to a way that was less efficient.

    Looking at reasons parallelism would not be used pointed to the obvious like maxdop 1 but its set to 0. The other thing is the cost threshold (whatever that means) its set 5, pretty sure cost was above 5 (a hunch). Another was the availability of worker threads, not sure if run on a hypervisor can affect this. Affinity settings but those are set to defaults.

    I guess my real question here is how would this be prevented? How do i know a query is about to blow up?

    Or just chalk it up to a fluke and forget about it.

    Edit: Also during the day after this happened as a test we dropped that index, CPU was pegged in about 15 seconds, put it back after capturing another exec plan and everything was fine again.

    • This reply was modified 5 years, 5 months ago by  TangoVictor.
  • A couple of things there...

    First, there's that Cost Threshold of Parallelism thing.  5 is way too low.  I have a mixed OLTP and Batch Processing system and I've found the sweet spot for the OLTP side of it is right around 25 and so I set it to 25.  I did set it to 5 once to see what would happen and things got really ugly CPU wise in a hurry.  I also tried increments of 5 up to 50 and 25 was the sweet spot.

    You said this Linq Query is doing a scan even on a good day.  That's not a good thing.  Can you attach the Actual Execution Plan (not a graphic but the actual saved plan) so we can take a look for you?

    Getting back to the index maintenance you ran that seems to coincide with this incident.  Ola's code is fine... the way people use it is not.  If you're using the typical settings of REORGANIZE at 10% and REBUILD at 30%, you're actually doing things wrong despite a whole lot of people claiming that it's a best practice.

    First of all, NEVER (and I don't say that often) REORGANIZE any index that has a "0" or "100" Fill Factor.  It doesn't work the way most people think.  It would take an hour to explain all the gazintas here but REORGANIZE removes critical space on pages when it's the most needed, frequently does nothing to improve average page density because it can't make extra pages, and isn't the tame little kitty that most people claim it is when it comes to resource usage.   I also tell people that the only time you should use REORGANIZE is to compact LOBs and, if you ever need to do that, then you need to see part 3 of my series on index maintenance so that you can avoid that in the future.

    I'll also tell you to NEVER do index maintenance on anything that has a "0" Fill Factor unless the purpose of the REBUILDs is to assign proper Fill Factors, which is a bit of an art.  Even REBUILDing indexes with a "0" or improperly assigned "100" Fill Factor can cause what it caused for me... MASSIVE BLOCKING on the morning after the index maintenance ran due to all the bad page splits that occur on most of your indexes even they have an ever increasing key because no one ever correctly accounts for "ExpAnsive Updates".  And, no... simply lowering the Fill Factor does NOT fix that problem.  It fact, you may be causing a hell of a lot of wasted memory and disk space with absolutely no benefit.

    And, to be sure, Paul Randal, the guy that wrote much of and was in charge of people writing the index maintenance stuff for SQL Server and is also the guy that wrote the BOL writeup for sys.dm_db_index_physical_stats.  In many of his presentations, he says flat out that MS pressured him for a recommendation and that's when the 10/30 recommendation was born.  If you actually do an in-depth study of the (his) BOL write up, you'll see that there's much more to index maintenance than just the 10/30 thing that most people have mislabeled as "Best Practice".  It's actually better to not do any index maintenance than to do it wrong and the 10/30 thing is SO WRONG!  I'm proof of that... I stopped doing any index maintenance for more than 2 years and CPU usage dropped from an average of 22% across 32 CPUs to 8% in 3 months and stayed there for the duration and blocking has all but vanished.

    To be sure, though, there are some indexes that end up with very low page densities and you will need to do REBUILDs on those to recover memory and disk space (don't do REORGANIZE on these... it usually won't actually help).  I recommend doing REBUILDs on such things at 82% for anything with page densities < 82% and 92% for those things with >82% but <92% page density over time.  To be equally sure, that should only be a stop gap until you can actually determine the correct Fill Factor.

    So, to summarize this post...

    1. Please attach the saved Actual Execution Plan so we can make a recommendation as to what needs to be done to your Linq Query and...
    2. Except for occasional page density repair as posted above, stop doing index maintenance on things that have a "0" Fill Factor until you can figure out the best Fill Factor, which includes determination of how "Expansive Updates" are affecting your indexes, causing usually unnecessary page splits even on index with "Ever Increasing Keys", and blocking you might not even be aware of.
    3. Change the Cost Threshold of Parallelism to 25 to start with and then slowly adjust it to find your "sweet spot".  Or, just leave it at 25.  Some people say set it to 50... I wouldn't go that high as the first and only guess.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You might also implement Query Store to get plans to watch and examine. There is a project port of this here: https://github.com/OpenQueryStore/OpenQueryStore/releases/tag/v2.1.0

  • Jeff, thank you for all the info and your time it's greatly appreciated!

    1. plans attached as xml, sqlplan is not allowed
    2. will start looking at that, that's a lot of good info.
    3. Cost threshold updated

    here are the index optimize run settings that were in use:

    @databases = 'ALL_DATABASES',

    @FragmentationLow = NULL,

    @FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',

    @FragmentationHigh = 'INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',

    @FragmentationLevel1 = 5,

    @FragmentationLevel2 = 30,

    @PageCountLevel = 20,

    @UpdateStatistics = 'ALL',

    @OnlyModifiedStatistics = 'Y',

    @LogToTable = 'Y'

    The table that i added the index to didn't have any index operations during the optimize job it was all updating statistics.

    Also, most of these tables assign a sequential guid PK and most tables are loaded by an ETL process.

    The new exec  plan with the added index (idx_db_ReportTime) is now showing another missing index, which i will look at as well. But right now that query is running ok.

    ExecPlanCurrentWithIndex is what's running now. The other one is from before the index when we had the problem.

    Please let me know if you need any more info.

    Thank You

    • This reply was modified 5 years, 5 months ago by  TangoVictor.
    • This reply was modified 5 years, 5 months ago by  TangoVictor.
    Attachments:
    You must be logged in to view attached files.
  • Steve, this is running 2012 but when we get it moved, yeah query store for sure.

  • The link is for OpenQueryStore, ported to 2012/2014. That's why I added it.

  • TangoVictor wrote:

    Jeff, thank you for all the info and your time it's greatly appreciated!

    1. plans attached as xml, sqlplan is not allowed
    2. will start looking at that, that's a lot of good info.
    3. Cost threshold updated

    here are the index optimize run settings that were in use: @databases = 'ALL_DATABASES', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE', @FragmentationHigh = 'INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE', @FragmentationLevel1 = 5, @FragmentationLevel2 = 30, @PageCountLevel = 20, @UpdateStatistics = 'ALL', @OnlyModifiedStatistics = 'Y', @LogToTable = 'Y' The table that i added the index to didn't have any index operations during the optimize job it was all updating statistics. Also, most of these tables assign a sequential guid PK and most tables are loaded by an ETL process. The new exec  plan with the added index (idx_db_ReportTime) is now showing another missing index, which i will look at as well. But right now that query is running ok. ExecPlanCurrentWithIndex is what's running now. The other one is from before the index when we had the problem. Please let me know if you need any more info. Thank You

    Not your fault but let me be perfectly clear here... As good as Ola's stuff is, it is patently NOT an index "optimizer".  Rebuilding or reorganizing an index is not index "optimization".  It's JUST defragging.

    Assigning the correct Fill Factor to an index is part of "optimizing".  Getting rid of "ExpAnsive" updates is "optimizing".  Selecting the correct keys, includes, and type of index for the task is "optimizing".  Index Maintenance can actually "deoptimize" an index if the maintenance isn't done correctly.

    Before I comment on your settings for the index maintenance... you said all of your IDs are "Sequential" GUIDs.  Do you know if any of these table suffer "ExpAnsive" updates where variable width columns go from NULL/Blank to something or something gets larger?

    Also, the execution plans you've provide appear to be "Estimated" execution plans rather than "Actual" execution plans (at least for the current plan you attached).  Any chance of you providing the "Actual" Execution Plans?

    Also, I don't see anything fundamentally wrong with the query except for the unnecessary wrapper query that Linq put around the inner query.  It does appear that you need an index on the DutyBlockID column of the CrewSignIn table.  I wouldn't add all of the includes that the missing index stuff is recommending.  If this is the predominate query for your system, consider changing the PK to a non-clustered index and creating a clustered index on the DutyBlockID, instead.  Since that's a fair bit of work, try just creating a non-clustered index on DutyBlockID and see what happens.  It's the easiest thing to do, may provide a benefit, and is the easiest thing to undo if it does not provide a benefit.

    And the reason I asked about "ExpAnsive" updates is because you said that Reorgs/Rebuilds are never recorded in the logging table.  That usually means there's no page splits and the resulting fragmentation going on in these tables.  Just trying to verify if that's true.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Steve, sorry i just saw query store and assumed, i'll check it out.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply