Performance issues post migration to 2016...looking for recommendations (Resolved)

  • Hello,

    We have been battling post go live over the weekend issues migrating to a 2 node 2016  instance.   We have a single availability group which was set up in a muli-subnet configuration.  2 nodes in the same data center and a DR node in a separate data center.

    Through the troubleshooting process we have removed the DR replica and are now running on two servers in the same data center.

    We have looked through most common things we can find and are still fighting  performance issues at the application level.

    As side note, we  moved the database in sql 2012 compatibility mode.  I know, not ideal but the business unit handed us that requirement due to external constraints we couldn't control.

    Looking at the current wait stats (sorry I can't figure out the best way to format it)

    WaitType                             Wait_S            Resource_S          Signal_S           WaitCount    Percentage          AvgWait_S           AvgRes_S           AvgSig_S
    ------------------------------------------------------------ --------------------------------------- --------------------------------------- --------------------------------------- -------------------- --------------------------------------- --------------------------------------- --------------------------------------- ---------------------------------------
    LOGPOOL_CONSUMERSET       15466257.01          15443302.91          22954.10           63143031     72.93            0.2449            0.2446            0.0004
    THREADPOOL                      3719972.02          3719970.17          1.86            17229779     17.54            0.2159            0.2159            0.0000
    PAGELATCH_EX                 641746.02           631383.73           10362.29           30001684     3.03            0.0214            0.0210            0.0003
    HADR_SYNC_COMMIT               369949.67           346209.12           23740.55           16507194     1.74            0.0224            0.0210            0.0014

    I cannot find much information on the largest wait stat (LOGPOOL_CONSUMERSET)

    When I look at current activity I see the majority of the spids have this wait type and the executing statement of the procedure is "ROLLBACK TRANSACTION".

    I am looking for any other recommendations to  troubleshoot or review.

    Thanks in advance,
    Lee

  • I would check:
    1. Do you have enough space in physical memory for the log buffer?
    2. Are there any long running transaction which is in the rollback pending state?
    3. Performance of the disk drive where ldf files are located.

  • the server configuration is 20 virtual CPU's and 275 gb of RAM allocated to the sql server (the server has a total of 400 gb configured).

    No long running transactions that are rolling back.  We have a  lot of small transaction that normally record metrics from logging devices.  The specific procedure that is doing this work is continually rolling back.  These calls run in a few miliseconds, have the noted wait type and when I check the executing statement they are rolling back.  I am continuing to review the procedure...

    Is there a good way to check  for your question #1? 

    I am being told by the storage team that  the disc is performing well.

    thanks for any input

    Lee

  • If you want to get your problems resolved ASAP, hire a professional tuning expert. You have given us essentially nothing to go on to help resolve what sounds like severe, production-affecting performance issues that have been going on for potentially days. There is no substitute for experience and training in this situation. We could go back and forth for more days and still be no closer to knowing the root cause(s) of your problems. 

    If you can't do that I would start with sp_whoisactive and differential file IO stall and wait stats analyses to identify actual issues. 

    I am curious why you disconnected your DR server. What metric did you use to come to the conclusion that doing so would improve performance? And did it help?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • the server configuration is 20 virtual CPU's and 275 gb of RAM allocated to the sql server (the server has a total of 400 gb configured).

    No long running transactions that are rolling back.  We have a  lot of small transaction that normally record metrics from logging devices.  The specific procedure that is doing this work is continually rolling back.  These calls run in a few miliseconds, have the noted wait type and when I check the executing statement they are rolling back.  I am continuing to review the procedure...

    Is there a good way to check  for your question #1? 

    I am being told by the storage team that  the disc is performing well.

    thanks for any input

    Lee

    TheSQLGuru - Monday, January 22, 2018 6:12 PM

    If you want to get your problems resolved ASAP, hire a professional tuning expert. You have given us essentially nothing to go on to help resolve what sounds like severe, production-affecting performance issues that have been going on for potentially days. There is no substitute for experience and training in this situation. We could go back and forth for more days and still be no closer to knowing the root cause(s) of your problems. 

    If you can't do that I would start with sp_whoisactive and differential file IO stall and wait stats analyses to identify actual issues. 

    I am curious why you disconnected your DR server. What metric did you use to come to the conclusion that doing so would improve performance? And did it help?

    I agree with your statement to a degree...at the moment nothing appears wrong or is causing an error on the mssql side.  Did a weekend migration, Monday morning...complaints of slowness in some web API portion of an application. 

    We have tried to review all aspects that are common...wait stats for example only yield what I posted above...which when I looked for a definition I found Paul's post on sqlskills.com

    https://www.sqlskills.com/help/waits/logpool_consumerset/

    The DR node was in Azure, in an asynch configuration...that was a new configuration as part of this migration.  Due to the issues experienced we were asked to remove that replica...and one of the things listed in the sqlskills link as an example was (for sending log blocks to asynchronous availability group replicas).

    so we dropped the replica from the AG after running out of things to review.  That change did not improve the performance issue from the client side.

    I guess , I didn't really know how to interrupt  that wait type and was reaching out to see if there was a recommendation on understanding what it means.

  • We were able to figure out the issue, which turned out to be code based.  One of the reasons it took awhile was partly our fault for not being able to pin point the issue on the database side coupled with the application having a bug...and not getting a lot of visibility from the app owner on that particular area.

    Turns out  the procedure that was referenced  previously  that was  continually rolling back at a high volume was the culprit.  initially when we brought that to the application team for review the response was it was working.  Later we circled back around to trace the code again for a second time we caught a Unique Key constraint violation which we question but again the app owner said that could happen based on how the code was written (1 failure in a batch could for the resubmit of the entire batch).  Once we heard that we focused on that specifically...and it took a very narrow trace to actually see the effects of that business logic.

    Bad data  and an even worse piece of business logic allowed the process to resubmit calls at an alarming rate.  For example...the specific call was executing around 48,000 times a minute at its peak and the volume of Unique Key violations was registered in a 1 minute trace  was just as high.

    So it was failing  almost every call, with a few new records getting in for each batch, but the batch would fail and the volume would grow as the batch was resubmitted.  It wouldn't show up in the general trace of just that procedure call via the GUI I am guessing because the GUI couldn't keep up and I didn't even consider it to be a scenario like that...especially when we were investigating several other areas from the top down (F5, App, etc.)

    Thanks for any responses and I do  realize i  didn't have a ton of info initially but I didn't even know where to begin considering the wait type we found initially.

    Lee

  • 1) Glad you got it fixed.

    2) "...I didn't even know where to begin..." - That is actually the single most important thing on this entire thread, and is the reason why a performance tuning professional should have been called in immediately. This event should have taken no more than 2 hours to discern, and most likely could have been isolated in less than 30 minutes from a first RDP into your environment. Knowing what is actionable/important/meaningful and what is a red herring only comes from extensive experience and training.

    3) GUI profiling is inappropriate for high-volume activity. This should have been done to local disk and then the trace file analyzed. I will add that everyone I know that uses profiler also captures WAY too much data, both from a column and event perspective. Bloat will crush your ability to do anything useful with traces. I have run my performance tuning trace template to local disk on systems with over 10000 batches per second with single-digit overhead on the workload.

    4) High-volume execution (and transaction) metrics can easily be captured using DMVs and perfmon counters. These are among the umpteen things I check for when doing "OMG we're screwed" support for clients both new and old.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru - Tuesday, January 23, 2018 8:49 AM

    1) Glad you got it fixed.

    2) "...I didn't even know where to begin..." - That is actually the single most important thing on this entire thread, and is the reason why a performance tuning professional should have been called in immediately. This event should have taken no more than 2 hours to discern, and most likely could have been isolated in less than 30 minutes from a first RDP into your environment. Knowing what is actionable/important/meaningful and what is a red herring only comes from extensive experience and training.

    3) GUI profiling is inappropriate for high-volume activity. This should have been done to local disk and then the trace file analyzed. I will add that everyone I know that uses profiler also captures WAY too much data, both from a column and event perspective. Bloat will crush your ability to do anything useful with traces. I have run my performance tuning trace template to local disk on systems with over 10000 batches per second with single-digit overhead on the workload.

    4) High-volume execution (and transaction) metrics can easily be captured using DMVs and perfmon counters. These are among the umpteen things I check for when doing "OMG we're screwed" support for clients both new and old.

    Thanks for the response....I would like to not sound so dumb in my initial post...kind of painted myself that way.

    We did troubleshoot initially, using pretty standard DMV queries which didn't point to an obvious sign with the provided symptoms from the app owners (works great for A but not for B and A and B do the same thing) and got stumped on that wait type which I couldn't find any information on.  In regards to the profiler use,  I understand your point and I was using a very narrow / minimalist approach...which wasn't written to a trace file initially but was dumped to a table for review to aggregate the results and look for clues.

    I will never claim to be  as savvy as most who are on this board, I can generally get through most things either via researching  the issue, or just knowing how to approach a given situation...

    I was stumped by that wait type and what t turned out to be inaccurate information from the application owner on what we were battling.

    Literally once we figured out that procedure call was  continually resubmitting the same data in batches  and causing such a volume of errors ...the batch process was halted, a simple code change to handle a check was put into place and the batch process restarted and finished within minutes.

    Totally stupid, but an easy fix.

    Thanks again for taking the time to respond  with your comments!

    Lee

  • I ignore what admins/devs tell me when digging into an outage scenario like this. The data I collect will do all the talking necessary almost without exception. And if you know what you are doing it won't lie/obfuscate/CYA/etc either.  😎 

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply