September 2, 2011 at 10:21 am
I have an unpredictable performance problem.
Symptoms are Increased CPU, LATCH_EX overall.
On average CPU times increase per query 10x.
Loose correlation between re-compilations/sec and the CPU spikes.
System has 48 Cores with 128G Ram.
Typical CPU load is less than 5%, during issues will rise to 10%.
SQL Server Waits will normally hover around 2000 ms/sec, but during these issue times will exceed 30000 ms/s
During issue CPU is 50%
Breakdown SOS (Normally .5%) and Consumed even with Signal Wait (Normally 2%) equal to combined of the other two
Latch being 23% of total (normally 3.5%) and nearly all of it is LATCH_EX
All other stats are negligible.
What makes this confusing, is launching of threads on the sever (NON-SQL was also an issue) so we reduced Max memory to 96G and after it reached that value, the system settled down. Thinking we found the problem, we just left it there.
Now 3 weeks later, we had the problem again. Memory is still at 96G. For a test, reduced it to 64G and it fixed the issue.
I have started capturing historical data from sys.dm_os_performance_counters for the following counters.
Buffer cache hit ratio
Buffer cache hit ratio base
Cache Hit Ratio
Cache Hit Ratio Base
Lock Blocks
Lock Blocks Allocated
Lock Memory (KB)
Lock Owner Blocks
Lock Owner Blocks Allocated
Lock Requests/sec
Lock Timeouts (timeout > 0)/sec
Lock Timeouts/sec
Lock Wait Time (ms)
Lock waits
Lock Waits/sec
SQL Compilations/sec
SQL Re-Compilations/sec
SQL Trace IO Provider Lock Waits
Table Lock Escalations/sec
Any ideas or suggestions on where to look next.
September 2, 2011 at 10:48 am
Latch_Ex on what kind of latch?
Latches (as opposed to PageLatch and PageIOLatch) are memory-related waits. Hence need to know what kind of latchwaits you're seeing.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
September 2, 2011 at 12:41 pm
Unfortunately I am not tracking that data historically, but looking at sys.dm_os_latch_Stats
ACCESS_METHODS_DATASET_PARENT Max of 8331 average of 9
BUFFER Max of 9922 average of 0
All others of note are "Internal use only" DBCC* and NESTING*
I'll start gathering values over time. (at least for those two)
PAGELATCH and PAGEIOLATCH have total wait time of 1/1000th of LATCH
September 2, 2011 at 12:48 pm
Reset counters
DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);
GO
Setup monitor to track value ever 10 minutes... And Now off to do some reading. Just the way to start off a holiday weekend.
September 2, 2011 at 12:49 pm
hmm.... @@version?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
September 2, 2011 at 1:04 pm
Microsoft SQL Server 2008 (SP2) - 10.0.4266.0 (X64) Nov 5 2010 16:11:39 Copyright (c) 1988-2008 Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )
September 2, 2011 at 1:17 pm
I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 2, 2011 at 1:19 pm
Grant Fritchey (9/2/2011)
I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..
There seems to be an echo in here... 😉
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
September 2, 2011 at 5:04 pm
GilaMonster (9/2/2011)
Grant Fritchey (9/2/2011)
I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..There seems to be an echo in here... 😉
HA! I didn't even click through on that one. Probably should have. If I'd noticed the head of the URL I would have known.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 12, 2011 at 9:03 am
Update.
SELECT * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms desc
showed that CXPACKET was in the top 10 (actually 4th highest) waits.
So we concentrated on parallel queries.
First thing I found out is that apparently in practice breaking up a query more than 8 degree's will reduce performance, so the current rule of thumb for MAXDOP is to set the value to the number of cores in a single socket. So in our case it was 6. In addition, some people have seen a benefit in disabling Hyper threading. We are not currently considering that, but might in the future.
I found a great article on parallel queries http://sqlserverpedia.com/blog/sql-server-bloggers/find-query-plans-that-may-utilize-parallelism/
This query is great
--
-- Find query plans that may run in parallel
--
SELECT DB_NAME(p.dbid) AS DBName ,
OBJECT_NAME(p.objectid, p.dbid) AS OBJECT_NAME ,
cp.usecounts ,
p.query_plan ,
q.text ,
p.dbid ,
p.objectid ,
p.number ,
p.encrypted ,
cp.plan_handle
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS q
WHERE cp.cacheobjtype = 'Compiled Plan'
AND p.query_plan.value('declare namespace p="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; max(//p:RelOp/@Parallel)', 'float') > 0
ORDER BY p.dbid, p.objectID
September 12, 2011 at 9:06 am
Cxpackt at 4th place??? I wouldn't bother, AT ALL, with it atm.
Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx
Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/
Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc
September 12, 2011 at 9:08 am
How long's the peak at 50%?
Sounds to me like you just have a olap query taking a little more ressources the your other "normal" queries.
September 12, 2011 at 10:09 am
The event's usually last between 5-10 minutes.
However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.
LAZYWRITER_SLEEP 716375247
DISPATCHER_QUEUE_SEMAPHORE 188715613
XE_DISPATCHER_WAIT 182711826
CXPACKET 179763111
BROKER_TASK_STOP 179128691
We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).
We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.
At
September 12, 2011 at 10:14 am
Bob Fazio (9/12/2011)
The event's usually last between 5-10 minutes.However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.
LAZYWRITER_SLEEP 716375247
DISPATCHER_QUEUE_SEMAPHORE 188715613
XE_DISPATCHER_WAIT 182711826
CXPACKET 179763111
BROKER_TASK_STOP 179128691
We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).
We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.
At
I'm really no expert with that big of a server, but good sense tells me that there probably should be a maxdop setting at server level (maybe 8, 12 or 16).
That way a rogue query shouldn't kill the whole server.
That being said, what does this return?
/*
Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx
Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/
Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc
Next query by Paul Randal http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx
*/
WITH Waits AS
(SELECT
wait_type,
wait_time_ms / 1000.0 AS WaitS,
(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
signal_wait_time_ms / 1000.0 AS SignalS,
waiting_tasks_count AS WaitCount,
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')
)
SELECT
W1.wait_type AS WaitType,
CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,
CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
W1.WaitCount AS WaitCount,
CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold
AND W1.WaitCount > 0;
GO
September 12, 2011 at 10:23 am
WaitTypeWait_SAvgWait_MSResource_SSignal_SWaitCountPercentage
CXPACKET25955.363.8222734.463220.90678771938.08
OLEDB17137.440.1417137.440.0012570371925.14
PAGEIOLATCH_SH8344.7426.258298.3446.4031791312.24
SOS_SCHEDULER_YIELD4127.010.1851.294075.72223269846.05
BACKUPIO2285.140.822273.7311.4127762383.35
WRITELOG1565.000.741444.10120.9021012872.30
LATCH_EX1354.820.161092.97261.8583067901.99
BACKUPTHREAD1235.28135.541235.010.2791141.81
ASYNC_IO_COMPLETION1017.534086.451017.520.012491.49
BACKUPBUFFER980.150.16900.9279.2263068881.44
ASYNC_NETWORK_IO936.730.30899.3837.3631182731.37
SQLTRACE_LOCK487.045.49465.6721.37887360.71
PREEMPTIVE_OS_AUTHENTICATIONOPS389.930.17389.930.0022847360.57
SLEEP_BPOOL_FLUSH328.392.93324.913.481119840.48
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply