Slow queries during busy times

Question

Slow queries during busy times

Casper101

Hall of Fame

Points: 3903
More actions
August 31, 2017 at 8:55 am

#338905

We have a SQL 2008 Enterprise Edition server with 32 GB of memory and 16 CPU.
During quiet times of the day, queries will complete in 1 - 2 seconds. I have optimized these queries and added proper indexing (they used to run much longer).
But as soon as it gets busy and more clients start using the sytem, queries are running exceptionally long. I can understand that queries will run a little longer when it gets busy, but in our case a 1 - 2 second stored proc is now taking 20 - 30 seconds. The CPU usage never really spikes and is on average 30 - 40 %
SQL usually reports PAGEIOLATCH waits during this time - am I correct that this is IO contention?
Could it be because there isn't enough memory and SQL has to get data from the disk?
Is adding more RAM the only solution?

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

Casper101 - Thursday, August 31, 2017 8:55 AM
We have a SQL 2008 Enterprise Edition server with 32 GB of memory and 16 CPU.
During quiet times of the day, queries will complete in 1 - 2 seconds. I have optimized these queries and added proper indexing (they used to run much longer).
But as soon as it gets busy and more clients start using the sytem, queries are running exceptionally long. I can understand that queries will run a little longer when it gets busy, but in our case a 1 - 2 second stored proc is now taking 20 - 30 seconds. The CPU usage never really spikes and is on average 30 - 40 %
SQL usually reports PAGEIOLATCH waits during this time - am I correct that this is IO contention?
Could it be because there isn't enough memory and SQL has to get data from the disk?
Is adding more RAM the only solution?

1) It always boggles my mind when clients pony up for Enterprise Edition and then hobble it with paltry amounts of RAM.

2) You have an EXTREMELY common scenario, and one that I see ALL the time. Not enough RAM coupled with poor IO. Low RAM means more stuff must be pulled from disk. And when you get busy this results in overwhelming the IO subsystem. And guess why CPU is low - because they are waiting for data to process!!

3) Good on you for recognizing that the above is the cause of your issues!! 🙂

4) More RAM is just mandatory, and is BY FAR the cheapest solution. But it may not be enough. What, EXACTLY, constitutes your IO for the server and how much active data do you have?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Casper101 Hall of Fame Points: 3903 More actions · Answer 2

Hi,

Thank you for your feedback, and for confirming my suspicions....
I inherited this server so the setup is not great:
* I fixed the maintenance processes with regards to indexing and stats, so I can now rule out fragmentation and outdated stats as the cause of my problems.
* There is just one drive which holds all the mdf and ldf files as well as the tempdb files.
Even though this is a VM server with SAN storage, I do believe that splitting all these to their own respective drives, will also help. Do you agree?
* Looking at Resource Monitor - This disk consistently has a high Queue length with Response Time averaging 600 ms, going up to 1400 ms when things get busy.
* Looking at Perfmon - the Avg disk sec/read and Avg disk sec/write is about 50 ms - and this is when the system is not busy..

So the fix I believe is:
More ram
Separating the files to different drives.
And creating 7 more tempdb data files - there currently is just the 1 mdf and 1 ldf.
Do you agree?

NorthernSoul SSCertifiable Points: 6869 More actions · Answer 3

Casper101 - Friday, September 1, 2017 12:56 AM
Hi,
Thank you for your feedback, and for confirming my suspicions....
I inherited this server so the setup is not great:
* I fixed the maintenance processes with regards to indexing and stats, so I can now rule out fragmentation and outdated stats as the cause of my problems.
* There is just one drive which holds all the mdf and ldf files as well as the tempdb files.
Even though this is a VM server with SAN storage, I do believe that splitting all these to their own respective drives, will also help. Do you agree?
* Looking at Resource Monitor - This disk consistently has a high Queue length with Response Time averaging 600 ms, going up to 1400 ms when things get busy.
* Looking at Perfmon - the Avg disk sec/read and Avg disk sec/write is about 50 ms - and this is when the system is not busy..
So the fix I believe is:
More ram
Separating the files to different drives.
And creating 7 more tempdb data files - there currently is just the 1 mdf and 1 ldf.
Do you agree?

I would agree with that but I would also add that tempdb needs to be on it's own drive.

Thanks

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

Casper101 - Friday, September 1, 2017 12:56 AM
Hi,
Thank you for your feedback, and for confirming my suspicions....
I inherited this server so the setup is not great:
* I fixed the maintenance processes with regards to indexing and stats, so I can now rule out fragmentation and outdated stats as the cause of my problems.
* There is just one drive which holds all the mdf and ldf files as well as the tempdb files.
Even though this is a VM server with SAN storage, I do believe that splitting all these to their own respective drives, will also help. Do you agree?
* Looking at Resource Monitor - This disk consistently has a high Queue length with Response Time averaging 600 ms, going up to 1400 ms when things get busy.
* Looking at Perfmon - the Avg disk sec/read and Avg disk sec/write is about 50 ms - and this is when the system is not busy..
So the fix I believe is:
More ram
Separating the files to different drives.
And creating 7 more tempdb data files - there currently is just the 1 mdf and 1 ldf.
Do you agree?

A) fragmentation is rarely the cause of significant performance problems
B) With shared storage splitting out data/log/tempdb files onto different "disks" is most often a red herring and provides no benefits. It is only worth pursuing when the storage is explicitly optimized for your SQL Server IO needs - i.e. just about never.
C) That IO is abysmal for a SQL Express installation. It is laughably horrific (and also truly sad) for an Enterprise Edition SQL Server.
D) 7 more tempdb files will almost certainly be a waste. VERY rarely is tempdb allocation a demonstrable problem. If you haven't observed it do not bother with more files. Too many files on too few spindles (or IO bandwidth) will CREATE IO problems, not help alleviate them.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 5

Casper101 - Friday, September 1, 2017 12:56 AM
Hi,
Thank you for your feedback, and for confirming my suspicions....
I inherited this server so the setup is not great:
* I fixed the maintenance processes with regards to indexing and stats, so I can now rule out fragmentation and outdated stats as the cause of my problems.
* There is just one drive which holds all the mdf and ldf files as well as the tempdb files.
Even though this is a VM server with SAN storage, I do believe that splitting all these to their own respective drives, will also help. Do you agree?
* Looking at Resource Monitor - This disk consistently has a high Queue length with Response Time averaging 600 ms, going up to 1400 ms when things get busy.
* Looking at Perfmon - the Avg disk sec/read and Avg disk sec/write is about 50 ms - and this is when the system is not busy..
So the fix I believe is:
More ram
Separating the files to different drives.
And creating 7 more tempdb data files - there currently is just the 1 mdf and 1 ldf.
Do you agree?

To echo what Kevin stated and to add some emphasis on my part, defragmenting indexes, especially non-temporal indexes, is a waste of time and will cause performance problems right after your defrag unless you pay very close attention to the FILL FACTOR. The culprit will be massive page splits for your tightly packed indexes which takes a lot of extra time and will cause blocking until the "Natural Fill Factor" (my term for it) has created sufficient extra space throughout the index to not require a page split during any given insert. My testing also shows that it's actually bad for memory. A fragmented index will typically end up with a "Natural Fill Factor" about 80% full pages over time. If you do a defrag, page splits become initially necessary for just about ever insert until there's enough room to prevent further page splits. Initially and by the very nature of a page split, you end up with a "Natural Fill Factor" of only 50%. It's kind of like shrinking a database... it's just going to grow again and fragmentation will be painful and rampant.

My recommendation is to simply stop rebuilding/reorganizing your indexes. And, just to let you know that this isn't just some wild speculation on my part, I haven't rebuilt/reorganized the indexes on any of my servers since 17 Jan 2016. Performance got markedly better over the first 3 months and has not degraded since then. The massive blocking that we used to get the day after index maintenance has simply vanished because we don't do index maintenance any more.

And, no, we've not wasted any disk space in the process. The pages that were originally split on day 1 have all filled in quite nicely since then. Since the pages all fill at different rates, full pages don't occur as often as when you do a defrag, which causes virtually all pages to split on first insert unless you have prescribed a Fill Factor, which will naturally happen over time without index maintenance.

What about contiguous data for batch runs? If you're running on a SAN, there is no such thing. Even if there were, any given process doesn't have exclusive use of the read/write heads on the disk. It's always jumping all over hell's little half acre servicing requests by other processes. 'Tis the nature of the beast.

As for RAM, I agree... depending on the operating system, you can't have too much for the Enterprise Edition of SQL Server.

As to it taking 1-2 seconds per query even on your currently RAM-crippled system, that's bad... real bad, even for a RAM-Crippled system. Putting TempDB on it's own drive probably won't help unless your SAN admin can guarantee that TempDB (and the same holds true for MDF/LDF files) for that one database will be limited to it's own private spindles, and he won't do that because of the today's size of drives. The San admin isn't going to give you even one dedicated spindle because the drive sizes are typically measured in the hundreds of Giga Bytes where you'll typically only need something less than 20GB (if the code is good). Although you say you've optimized code with indexes and the like, there's still something wrong because properly optimized code shouldn't need a whole lot of TempDB. My production system has a 1.3TB database, a 900GB database, and a couple of dozen databases near 100GB... I have TempDB setup as eight 2GB files and they haven't grown 1 iota in 5 years. In that time, the two large databases started out at about 65GB each and have grown to their current size with no growth of TempDB.

As for the long queue lengths, that's probably because of bad code that has contention, inefficient resource usage, and long run duration. As for the Avg disk sec/read and Avg disk sec/write being 50ms, that seems to be quite long. You might want to have the SAN admin run a throughput test on the San to first eliminate that as a problem and then do similar from the server to make sure that your not having communication problems between the server and the San. Please see the following articles for a more information on these two counters because they are indicating a bit of a problem.
https://www.mssqltips.com/sqlservertip/2460/perfmon-counters-to-identify-sql-server-disk-bottlenecks/
https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon/

To summarize the recommendations of the second article, which was written by someone from the "Microsoft Enterprise Platforms Support" group, you'll find the following statement in the "Conclusion" of the article...

To keep the queue under control you have to tune your applications to limit the maximum number of outstanding I/O operations they generate.

If you can prove that the San doesn't have a problem (and it usually won't but don't just assume that) and you can prove that the pipe between the server and the SAN is operating at up-to-snuff speeds instead of at USB 2.0 speeds 😉 (simple things like not running at FULL DUPLEX and having "auto-negotiate" turned on are the usual culprits along with bad cable/connections or bad switches/NICs/settings), then there's only one thing left and that's inefficient code.

As for the code, remember that there are two main parts to what a stored procedure or code from the ORM does... yes, there's actual run time but there's also "compile" time. At work, we had code that would run in 100ms... but it took 2 to 22 seconds to compile and it had to compile every time because the code was so poorly "written" by the ORM. Most people never look into it because they aren't even aware that such a thing can happen. They just set "Optimize for Ad Hoc Queries" to "ON" and never think about it again even after recompiles become the main culprit.

Here's a proc to put into your "DBA" database to find recompile problems. Details are in the comments in the proc.

CREATE PROCEDURE dbo.ShowCompile /********************************************************************************************************************** Purpose: Return the longest code compile times and related information from cache.

-----------------------------------------------------------------------------------------------------------------------
Programmers Notes:
1. The "ShowLongString" function must be available in the current database.

-----------------------------------------------------------------------------------------------------------------------
Usage:
--===== Default usage returns 1000 rows. Will take 10 minutes depending on the number of items in cache.
   EXEC dbo.ShowCompile
;
--===== Include the number of rows to return. Still takes a fairly long time to return even with just 10 rows.
   EXEC dbo.ShowCompile 10
;
-----------------------------------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 16 Jul 2012 - Jonathan Kehayias - Initial Publication
        - Ref: https://www.sqlskills.com/blogs/jonathan/identifying-high-compile-time-statements-from-the-plan-cache/

Rev 01 - 18 Jun 2016 - Jeff Moden
        - Modification to make the returned code easier to read and have more than 8k characters visible.
**********************************************************************************************************************/
--===== Define the I/O for this proc
        @pRows INT = 1000
     AS
--===== Enable dirty reads.
    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
;
--===== Read the compile times and other information from cache.
WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT TOP (@pRows)
        CompileTime_ms,
        CompileCPU_ms,
        CompileMemory_KB,
        qs.execution_count,
        qs.total_elapsed_time/1000 AS duration_ms,
        qs.total_worker_time/1000 AS cputime_ms,
        (qs.total_elapsed_time/qs.execution_count)/1000 AS avg_duration_ms,
        (qs.total_worker_time/qs.execution_count)/1000 AS avg_cputime_ms,
        qs.max_elapsed_time/1000 AS max_duration_ms,
        qs.max_worker_time/1000 AS max_cputime_ms,
        (--==== Returns the query snippet as clickable/format-preserved code (Rev 01)
         SELECT LongString
           FROM dbo.ShowLongString(SUBSTRING(st.text,
                                             (qs.statement_start_offset / 2) + 1,
                                             (CASE qs.statement_end_offset
                                              WHEN -1 THEN DATALENGTH(st.text)
                                              ELSE qs.statement_end_offset
                                               END - qs.statement_start_offset) / 2 + 1))) AS StmtText,
        query_hash,
        query_plan_hash
   FROM (--==== Get's the compile time information
         SELECT c.value('xs:hexBinary(substring((@QueryHash)[1],3))', 'varbinary(max)') AS QueryHash,
                c.value('xs:hexBinary(substring((@QueryPlanHash)[1],3))', 'varbinary(max)') AS QueryPlanHash,
                c.value('(QueryPlan/@CompileTime)[1]', 'int') AS CompileTime_ms,
                c.value('(QueryPlan/@CompileCPU)[1]', 'int') AS CompileCPU_ms,
                c.value('(QueryPlan/@CompileMemory)[1]', 'int') AS CompileMemory_KB,
                qp.query_plan
           FROM sys.dm_exec_cached_plans AS cp
          CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
          CROSS APPLY qp.query_plan.nodes('ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS n(c)
        ) AS tab
   JOIN sys.dm_exec_query_stats AS qs
     ON tab.QueryHash = qs.query_hash
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY CompileTime_ms DESC
OPTION (RECOMPILE)
;

There is a dependency on a function called "dbo.ShowLongString" in that code. Here's that function, which must be in the same database as the code above.

CREATE FUNCTION dbo.ShowLongString /********************************************************************************************************************** Purpose: Display a string of more than 8000 characters. The string can be Dynamic SQL, XML, or just about anything else.

Usage:
--===== Example with Dynamic SQL
DECLARE @sql VARCHAR(MAX);
SELECT @sql = '
SELECT somecolumnlist
   FROM some table with joins
;'
;
SELECT LongString
   FROM util.ShowLongString(@SQL)
;
--===== Example with a call to a table or view
SELECT sm.Object_ID, Definition = ls.LongString
   FROM sys.SQL_Modules sm
CROSS APPLY dbo.ShowLongString(sm.Definition) ls
;
Revision History:
Rev 00 - 20 Sep 2013 - Jeff Moden - Initial creation and test.
       - Hats off to Orlando Colamatteo for showing me the trick for this.
**********************************************************************************************************************/
--===== Declare the I/O for this function
        (@pLongString VARCHAR(MAX))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
SELECT LongString =
        (
         SELECT REPLACE(
                    CAST(
                        '--' + CHAR(10) + @pLongString + CHAR(10)
                    AS VARCHAR(MAX))
                ,CHAR(0),'') --CHAR(0) (Null) cannot be converted to XML.
             AS [processing-instruction(LongString)]
            FOR XML PATH(''), TYPE
        )
;

To summarize it all, you need to add some RAM. I recommend at least 128GB for the 16 processors you have. Leave at least 8GB for the operating system. You need to check SAN performance and the performance of the pipe. Last but probably most important, you need to fix some inefficient code because 1-2 seconds (especially for front-end procs) is terrible. There is no magic bullet for such code. You need to fix the code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 6

Ah... almost forgot. Check your connection strings. If they enable MARS (Multiple Active Result Sets), turn that off NOW! It doesn't do what a lot of people think and should only be used in special circumstances. In the presence of "poorly performing" code, it will become your worst nightmare like it did for us. It caused the queries of 19 of 32 CPUs on a 256GB RAM box with lightning fast I/O times to go into a KILLED/ROLLBACK for 10 to 30 minutes at a time. Up to 400 sessions were blocked during these incursions and it caused overall CPU usage to quickly ramp up to > 80%... remember... that was across all 32 CPUs. You just have to know that about 19 of them were slammed to the wall.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Casper101 Hall of Fame Points: 3903 More actions · Answer 7

Thank you all for your replies and suggestions! Much appreciated!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 8

Jeff Moden - Monday, September 4, 2017 9:27 AM
Ah... almost forgot. Check your connection strings. If they enable MARS (Multiple Active Result Sets), turn that off NOW! It doesn't do what a lot of people think and should only be used in special circumstances. In the presence of "poorly performing" code, it will become your worst nightmare like it did for us. It caused the queries of 19 of 32 CPUs on a 256GB RAM box with lightning fast I/O times to go into a KILLED/ROLLBACK for 10 to 30 minutes at a time. Up to 400 sessions were blocked during these incursions and it caused overall CPU usage to quickly ramp up to > 80%... remember... that was across all 32 CPUs. You just have to know that about 19 of them were slammed to the wall.

Interesting. Do you have any knowledge of why that happened? And what types of queries where the one's that went out to lunch? Any patterns in anything you can relate to us?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 9

TheSQLGuru - Tuesday, September 5, 2017 6:02 AM
Jeff Moden - Monday, September 4, 2017 9:27 AM
Ah... almost forgot. Check your connection strings. If they enable MARS (Multiple Active Result Sets), turn that off NOW! It doesn't do what a lot of people think and should only be used in special circumstances. In the presence of "poorly performing" code, it will become your worst nightmare like it did for us. It caused the queries of 19 of 32 CPUs on a 256GB RAM box with lightning fast I/O times to go into a KILLED/ROLLBACK for 10 to 30 minutes at a time. Up to 400 sessions were blocked during these incursions and it caused overall CPU usage to quickly ramp up to > 80%... remember... that was across all 32 CPUs. You just have to know that about 19 of them were slammed to the wall.
Interesting. Do you have any knowledge of why that happened? And what types of queries where the one's that went out to lunch? Any patterns in anything you can relate to us?

I do, indeed. Here's an example of the types of queries that were being created by the ORM (legacy Linq2SQL, IIRC) for what we call the "Insurance Lines Update Screen". I didn't put it in a code window because it would drive the formatter behind the scenes absolutely bonkers. I left out the part where the variables are defined because it contained PII.

UPDATE [dbo].[InsuranceLines]
SET [Modified_Date] = @p51,
    [IsIntentToCancel] = @p52,
    [CsrCancelReinstateCode] = @p53,
    [PolicyID] = @p54
WHERE ([line_ID] = @p0)
AND ([loan_id] = @p1)
AND ([freq_id] = @p2)
AND ([carrier_Payee_ID] = @p3)
AND ([Insurance_Type_ID] = @p4)
AND (NOT ([escrowed] = 1))
AND ([Policy_Number] = @p5)
AND ([premium_amount] = @p6)
AND ([Invoice_PremiumAmount] IS NULL)
AND ([premium_term] IS NULL)
AND ([premium_due_date] = @p7)
AND ([agency_Payee_ID] = @p8)
AND ([Comments] IS NULL)
AND ([mortgage_clause_correct] IS NULL)
AND ([Expiration_Date] = @p9)
AND ([Effective_Date] = @p10)
AND ([Cancel_Date] IS NULL)
AND ([Deductible_Amt] = @p11)
AND ([Disbursement_Amt] = @p12)
AND ([Other_Amt] = @p13)
AND (NOT ([No_Premium_Flag] = 1))
AND (NOT ([To_Disburse_Flag] = 1))
AND ([ESS_Tracking_No] IS NULL)
AND ([LastPaymentDate] IS NULL)
AND ([LastPaymentAmount] IS NULL)
AND ([Audit_Action] IS NULL)
AND ([Audit_Type] IS NULL)
AND ([Created_Date] = @p14)
AND ([Created_By] = @p15)
AND ([Modified_Date] = @p16)
AND ([Modified_By] = @p17)
AND ([carrier_payee_id_save] IS NULL)
AND ([preferred_method_of_contact] IS NULL)
AND (NOT ([RushYN] = 1))
AND (NOT ([AddStopYN] = 1))
AND ([Escrow_Balance] IS NULL)
AND ([Deductible_amount] = @p18)
AND ([Hurricane_Windstorm_deductible_amount] IS NULL)
AND ([Insurance_Last_Paid_Date] IS NULL)
AND ([Date_of_Loss] IS NULL)
AND ([Loss_Amount] IS NULL)
AND ([FDArea] IS NULL)
AND ([Late_Fees] IS NULL)
AND ([Replacement_Coverage_Flag] = @p19)
AND (NOT ([Payment] = 1))
AND ([isLPCycle_Flag] IS NULL)
AND (NOT ([Pay_AgentFlag] = 1))
AND (NOT ([wind_excluded_flag] = 1))
AND ([Wind_Deductible_Amount] IS NULL)
AND ([AdapterTransferStatusCodeIn] = @p20)
AND ([AdapterErrorCode] = @p21)
AND ([AdapterRevisionNumber] = @p22)
AND ([AdapterSourceSystemCode] IS NULL)
AND ([AdapterSourceObjectID] = @p23)
AND ([ActivityStatusCode] = @p24)
AND ([CondoAssociationName] IS NULL)
AND ([CancelReinstateReason] IS NULL)
AND ([NextDisbursementDate] IS NULL)
AND ([RCPercentage] IS NULL)
AND ([AddtoPremiumFlag] IS NULL)
AND ([GrandFatheredYN] IS NULL)
AND ([InsuranceLinesStopCode] IS NULL)
AND ([InsuranceLinesCancelReinstateCode] IS NULL)
AND ([AgentName] = @p25)
AND ([AgentPhoneNumber] = @p26)
AND ([AgentFaxNumber] = @p27)
AND ([FloodZone] IS NULL)
AND ([TotalCoverageAmount] = @p28)
AND ([UnitCountRequired] = @p29)
AND ([UnitCountInsured] IS NULL)
AND ([UnitCountReference] = @p30)
AND ([UnitCoverageAutoCalcFlag] = 1)
AND ([KeySequence] = @p31)
AND ([KeyPayeeCode] = @p32)
AND (NOT ([IsCollateralInKey] = 1))
AND (NOT ([IsBusinessKeyValid] = 1))
AND ([Coverage_Amt] = @p33)
AND ([CollateralID] = @p34)
AND ([InsuranceFormCode] = @p35)
AND ([IsMasterPolicy] = @p36)
AND (NOT ([IsGapLine] = 1))
AND (NOT ([IsLenderPlacedLine] = 1))
AND (NOT ([IsContentsCoverage] = 1))
AND (NOT ([IsImprovementsCoverage] = 1))
AND ([IsStructureCoverage] = @p37)
AND (NOT ([IsCoverageKeyValid] = 1))
AND ([Carrier_PayeeCode] IS NULL)
AND ([Agency_PayeeCode] = @p38)
AND ([RenewalExpirationDate] IS NULL)
AND ([PolicyTermMonths] = @p39)
AND ([RemitToPayeeCode] IS NULL)
AND ([RemitToPayeeID] IS NULL)
AND (NOT ([IsNewRemitToPayeeCode] = 1))
AND (NOT ([IsNewCarrierCode] = 1))
AND (NOT ([IsNewAgencyCode] = 1))
AND ([AgentEmailAddress] IS NULL)
AND ([AgentAddressLine1] = @p40)
AND ([AgentAddressLine2] IS NULL)
AND ([AgentCity] = @p41)
AND ([AgentState] = @p42)
AND ([AgentZip] = @p43)
AND ([AgentISOCountry] = @p44)
AND ([pfi_client_id] = @p45)
AND ([IsIncomeCoverage] IS NULL)
AND ([fire_excluded_flag] IS NULL)
AND ([earthquake_excluded_flag] IS NULL)
AND ([terrorism_excluded_flag] IS NULL)
AND ([liability_excluded_flag] IS NULL)
AND ([flood_excluded_flag] IS NULL)
AND ([LiabilityIncidentCoverageAmount] = @p46)
AND ([LiabilityLimitCoverageAmount] = @p47)
AND ([IsIntentToCancel] IS NULL)
AND (NOT ([IsPrimaryLineForPolicy] = 1))
AND ([Modified_TransactionID] IS NULL)
AND ([LenderPlaceStopCode] IS NULL)
AND ([LenderPlaceStopEffectiveDate] IS NULL)
AND ([InsuranceLinesLockCode] = @p48)
AND ([CsrCancelReinstateCode] IS NULL)
AND ([CsrCancelReinstateDate] IS NULL)
AND ([PolicyID] IS NULL)
AND ([EscrowLineID] IS NULL)
AND ([IsLossOfRentCoverage] IS NULL)
AND ([ClientSystemID] = @p49)
AND ([IsREOLine] IS NULL)
AND ([EscrowBindingCode] IS NULL)
AND ([AdapterUpdateFlag] IS NULL)
AND ([InvoiceNumber] IS NULL)
AND ([IsPerpetualDisbursement] IS NULL)
AND ([LPI_CoverageKeyID] IS NULL)
AND ([FormDwellingDeductiblePercentage] IS NULL)
AND ([WindDeductiblePercentage] IS NULL)
AND ([EarthquakeDeductiblePercentage] IS NULL)
AND ([MaximumReplacementPercentage] IS NULL)
AND ([MaximumReplacementAmount] IS NULL)
AND ([EarthquakeDeductibleAmount] IS NULL)
AND ([LineCorrelationKey] = @p50)
AND ([InsuranceOccupancyCode] IS NULL)
AND ([AssociationCoverageAmount] IS NULL)
AND ([PreppedNoticeActivityId] IS NULL)

SELECT
[t1].[BusinessKeyDuplicateID],
[t1].[KeyCollateralID],
[t1].[DuplicateCoverageKeyID],
[t1].[CondensedPolicyNumber],
[t1].[CompressedPolicyNumber]
FROM [dbo].[InsuranceLines] AS [t1]
WHERE ((@@ROWCOUNT) > 0)
AND ([t1].[line_ID] = @p55)

Ironically, the first line of the WHERE clause in the UPDATE contains the PK of the table being updated. This query was fired about 8,000 times per hour.

Now... no one thought it was a problem and that I was full of hooie (I found it 2 years ago) because it "only" takes about 100ms (8,000 * 100ms = 13.3 minutes per hour of one CPU) to execute. It took me a while to figure out what my gut was telling me (it kept showing up as a duration problem on sp_ShowWorst but CPU was always "fine") but it turned out that the code is so bad and minor changes are made every time it's fired and we all know what that means. It's guaranteed to recompile every time it's executed. It turns out that it was taking between 2 seconds and 22 seconds to recompile with an average of 7 seconds for every recompile. Multiply those numbers times 8,000 per hour and you can see why it kept several CPU's busy (increased CPU from about 10% to 22% total CPU for 32 CPUs).

I got the compile-time analysis code from Jonathon Kehayias' post at https://www.sqlskills.com/blogs/jonathan/identifying-high-compile-time-statements-from-the-plan-cache/ (with a slight modification to make the FullText clickable and complete).

If you look what MARS does at the following URL, it looks very useful.
https://docs.microsoft.com/en-us/sql/relational-databases/native-client/features/using-multiple-active-result-sets-mars
What most people miss, though, is the following advisory...

Note

A batch or stored procedure which starts a manual or implicit transaction when MARS is enabled must complete the transaction before the batch exits. If it does not, SQL Server rolls back all changes made by the transaction when the batch finishes. Such a transaction is managed by SQL Server as a batch-scoped transaction. This is a new type of transaction introduced in SQL Server 2005 to enable existing well-behaved stored procedures to be used when MARS is enabled. For more information about batch-scoped transactions, see Transaction Statements (Transact-SQL).

Well, it wasn't well behaved code (compile times were insane) and it wasn't a stored procedure. It was ORM code and it was so "tender" that if you sneezed on it, it would take just a little too long according to whatever MARS is doing and start the KILLED/ROLLBACK cascading failure. Here's what the blocking capture system and analysis stored procedure I wrote showed... 19 CPUs fully involved in KILLED/ROLLBACKs (not showing the other 520 or so connections here). All of the code was almost the same as the code I posted. Every time one KILLEDROLLBACK completed, another would spring into place because of the number of backed up sessions.

It blocked the hell out of a bunch of other sessions. Here's what my normally quiet PerfMon looked like during one of the incursions. It's a good thing that I keep an RDC open to the server at all times because no one else could get in to see what was going on because the server had gotten to the point where it wasn't accepting any new connections.

The heavy Black line is the total cpu usage for 32 CPUs. The Yellow line starting at 40 (which is really 400 on the "Y" scale) is for sessions. They jumped to 560 (56 on the chart) and they were all waiting because they couldn't get a word in edgewise. My normal CPU usage was something less than 10% until this problem reached the tipping point. The heavy Red Line is the blocking, which wasn't as bad as I "expected". The increased number of sessions was just crazy. It was all like watching an out of control fire... absolutely terrifying because there was nothing I could do about it (rollbacks) and knew the damage it was causing but fascinating to watch 32 CPUs average more than 80% for that long. The cooling system earned it's money during these incursions.

This was occurring 2 to 4 times a day at different times and sometimes back to back for about a two weeks (we were testing a fix not yet knowing about MARS) before we (my boss, actually... he took all the "evidence" I provided and he was the one that figured it out... Good to work for such an Enterprise Architect) finally figured out that MARS was the cause and killed it in all connection strings everywhere. What was also bad was the CPU before and after. It was averaging about 22% even after we removed MARS from all the connection strings (which stopped the KILLED/ROLLBACKs and they didn't happen again). We did, however finally get the fix in (it took a long time to test), which required a GUI deployment because of the ORM code. The fix was to rewrite what the ORM was doing as a stored procedure using the PK to quickly lookup the row to be updated and then doing the checks with a much more sane WHERE clause in the UPDATE. CPU dropped to about 7% as soon as the fix went in. The GUI Developers on our team definitely know how to "get the current date and time" even in T-SQL. 😉

The really good part for me about all of this is that people in high places insisted that the OPs group add more memory and CPU's to the dedicated box the VM for the server resides on to try to fix the problem. I was going to tell them it's a waste of time but then thought the better of it. I now have 384GB of RAM and a 48CPU fire breathing monster to care for and feed. Don't need it all but I'll keep it. 😀 Gotta love the headroom.

A side bar to all of this is that this, once again, proves that performance is in the code (or not if you catch my drift 😉 ) Throwing hardware at either of the two problems (KILLED/ROLLBACKs and CPU Intensive code) didn't help at all (and that's been the case with all the problems our predecessors left us).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Casper101 Hall of Fame Points: 3903 More actions · Answer 10

I would like to ask one more thing:
I am seeing a high ratio of Recompilation - on average over 50% of incoming batches and of Compilations per second.
How can i find the offending queries being ReCompiled?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

That is an awesome response Jeff!! Thanks. Worthy of an article actually, but it (kinda) sadly wouldn't benefit many people because hardly anyone uses MARS (THANKFULLY!!).

I am quite surprised that it created such an ugly plan with a PK filter in there.

It's too bad your Enterprise Architect boss didn't design NULLs out of that schema.

And that is an incredibly silly use of @@ROWCOUNT!!!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 12

TheSQLGuru - Wednesday, September 6, 2017 11:16 AM
That is an awesome response Jeff!! Thanks. Worthy of an article actually, but it (kinda) sadly wouldn't benefit many people because hardly anyone uses MARS (THANKFULLY!!).
I am quite surprised that it created such an ugly plan with a PK filter in there.
It's too bad your Enterprise Architect boss didn't design NULLs out of that schema.
And that is an incredibly silly use of @@ROWCOUNT!!!

Actually, an article on the subject would still be a good idea even without the MARS stuff. It was still taking (for example) 2 - 22 seconds to compile and it did a recompile every time it ran because of changes in the bloody monster WHERE clause. It also clearly demonstrates how bad ORM code can be. And I can't blame the Enterprise Architect for the NULLs... he wasn't the one that was large and in charge when most of this crap was designed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 13

Casper101 - Wednesday, September 6, 2017 8:35 AM
I would like to ask one more thing:
I am seeing a high ratio of Recompilation - on average over 50% of incoming batches and of Compilations per second.
How can i find the offending queries being ReCompiled?

Use the query at the link I posted. It takes a long time to run (about 10 minutes on my system) but it's worth it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)