Sudden unexplained performance issue

  • I don't even know where to begin.  "I have this customer"...

    Data Warehouse, big 1tb database.  4 processors, 32g ram, well indexed, well maintained database on VMWare.  This thing has been in production for years and despite the skimpy memory/processor counts has been a reliable well performing system.

    Then suddenly... Sometime between 5am 8/30 and 5am 8/31 to be precise... the whole thing went sideways.

    There is this poorly written query, bunch of CTEs strung together that eventually get used in a single select with a bunch of LOJ to tie them all together.  This silly query runs, and has run for over 2 years, in 47 seconds.  Query plan is what you want, indexes all tuned up and ready...just like always.

    Then suddenly... on 8/31; 5 hours in, and it's still running.  There is not another single query running.  Activity Monitor is FLAT at 0 mb on disk IO.  Nearly anything sent to the server that is more than a couple of tables in anything other than a basic join gets "shelved".  Memory and processors are equally dull.  Activity monitor says it has the query, but the server seems to be in a "yeah, not right now... i'll get to that later" mode.  CTFP is 25 and Maxdop is 2.

    I can reorg indexes, update statistics, flush the procedure cache... nothing will bring the old performance back...except.

    Now get this... if (spoiler alert... I don't want this to turn into a MaxDop conversation, because that's not the issue...something else is going on).  If I force that 47 second well-established query with option (maxdop 1)... blamo... 47 seconds.  If I reset Maxdop at the server level to 1... all those other queries that suddenly no longer finish ... all run... well... most of them anyway.

    For the first time ever... TempDB growth is through the roof... 74g of growth in an hour running a single query that used to run in a minute or so...that now won't finish; and this is all pretty basic stuff, established queries on well indexed tables.

    With maxdop 2... the server shows a CXConsumer "GetWaitPipe" waitstate ... If I force maxdop to 1... that goes away.

    So... going back to the beginning... "Nothing in this environment has changed"; Not the SAN, not VMWare, not the VM Architecture, no patches, no updates anywhere... It's like this box just doesn't want to come to work anymore.  I don't even know where to go next.

    Ideas?

  • Can you set MAXDOP in just that misbehaving query so it runs fast while you sort out what's going on?

  • The execution plan stays the same between MaxDop 1 and other settings? Cause overall, I'd say you crossed a threshold on row counts. That's when I've seen things go south. Questionable, but OK, quiery runs for years then suddenly blows up. Looking at the row estimates, we crossed a threshold that changed the plan costs so that a new, badly performing, plan was introduced. However, if the plan is EXACTLY the same (look at the estimated row counts and all the details, use SSMS to compare plans), then I'm not sure.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I'd think you crossed some threshhold for the optimizer as well. These are the maddening problems in SQL Server. Positive no code has changed? I assume you have an older version of the procs in a VCS or a backup. Verify this for sure. No patches or other changes? It's likely data has changed to the point where a new plan is preferred.

     

    FWIW, this is the type of problem MS likes to see in their optimizer research if you have the before/after plans and can repro this.

  • One of the things to try :

    if it works fine with maxdop 1,  enable QueryStore if it's not enabled yet,  get fast plan captured in it and force the query to use fast plan

     

  • Andrey wrote:

    One of the things to try :

    if it works fine with maxdop 1,  enable QueryStore if it's not enabled yet,  get fast plan captured in it and force the query to use fast plan

    And if you do go this route, know that you can enable query store, but disable data capture so that you can force a plan (you'll have to enable capture long enough to get the plan) without having the rest of the overhead of query store. Just an idea.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Without seeing the query it's difficult to say exactly what could improve it. But something I've done before with successful results is to split a query into more than one query. So, for example, you could take the results from each of the CTE's and insert them into temporary tables, then index the temporary tables and use the temporary tables in the final part of the query instead of the CTEs.

  • Thanks for the answers guys; I had wondered about the row threshold, but was not sure if that could even be a possibility.

    I did roll maxdop down to 1 at the server level so that the rest of the daily reports and jobs would run.  As for the one bad query...that's one out of 50 that "suddenly" don't work.  While I can pick on this one query with the stacked CTEs as being a target for "lets talk about query technique"; there are many other queries that were fine pre 8/31 that are not now... This one CTE is just one I had some stats on.

    For forcing the "fast" query plan; I would have to do that for roughly 80% of the production queries on this box since most no longer run timely...it's not just the 1.

    Reducing the number of rows in the database is certainly an option (we have an archive process where we can move "days" off of tables into annualized databases.  What baffles me though... is that this database is not the largest (nor the longest) in the family... why this magical row threshold would be relevant boggles my mind.

    We did run the queries on a dev version of the warehouse (which has much fewer days in the database, thus shorter tables) and the query plan is the same on both...and... runs in the old 47 seconds instead of the new 5 hours.

     

     

  • I don't believe in magic, there should be differences in query plans (prod slow, prod fast maxdop 1, dev fast) and/or statistics.

    Also, it's worth to compare "select @@version" on both servers, resources allocated to SQL server, etc.

     

  • There has to be difference of some kind. Always. If the plans are absolutely identical (and please, not just the graphics are the same, the data behind it is the same, use the comparison tool within SSMS, this blog post gives quick overview on it), then the problem is elsewhere. Server settings. Database settings. Resource contention, blocking, locking, waits. All this could be different. We need more metrics to understand the issue, not just execution time and MAXDOP settings (although, MAXDOP settings changes that are affecting the performance WITHOUT changing the execution plan is very suspect).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant... despite the pre-requisite "nothing has changed" answer from everyone involved in this environment...

    I was discovered this morning that VMWare 6.7u3 was applied (upgrading from VMware 6.7u2) 5 days before the trouble.  However...the hardware version is 14 which may have an incompatibility with u3 in the area of processor access...

    2 options were tossed out today...upgrade to HW version 15... or... roll back u3.  Most of these numbers are greek to me but for the VMWare folks, perhaps this means something.

    Stay tuned...

  • Good luck. I'm pretty ignorant on the VMWare front as well. Nice to find out what changed though.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • No Joy...

    Back to forcing maxdop @ 1 for our sample query gives data in 47 seconds... maxdop 2, the query shows running, but there is NOTHING happening on activity monitor.  I did discover yesterday that as this query is sitting doing "nothing"...tempDB is filling like MAD (for a query that produces < 500 rows and 2 dozen columns. ... and the log is showing TempDB waits...

    Still unable to figure out what changed last Thursday morning.  I remain convinced it is something outside the purview of SQL and Windows, something up @ the VMWare level that is restricting processor access.

  • If it's a "poorly written query", why don't you just rewrite it?

  • Because at the moment... it's a baseline.  And the scenario is more than just 1 query... there are hundreds of otherwise (previously) well behaved queries that now either run for 100 times their previous metrics... or never finish at all.

    Everyone is convinced it is something at the VMWare level; we're standing up a new server to test those scenarios.  But so far... sometimes the server will run your query...sometimes it'll shuffle it off to temp db for hours and never actually work on it; sorta like that guy in the corner that stares into his phone all day and never gets anything done.

    And this is from a server that has been well established and in production for over 2 years...

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply