July 26, 2011 at 11:00 am
Hi All,
I was wondering what you would advice the value for max degree of parallesim on the server level on an OLTP server with 24 logical CPUs and 256 GB RAM. It is a pretty busy server. Averaging 13k batch request per second.
Non NUMA architecture.
-Roy
July 26, 2011 at 11:59 am
On a personal level, I'd probably start at serverwide MAXDOP 8 with a cost setting of ~10-15 and then monitor the heck out of it looking for the pain points. I assume you're moving an existing server/instance onto the new hardware. What did the old machine look like?
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
July 26, 2011 at 12:15 pm
Roy Ernest (7/26/2011)
I was wondering what you would advice the value for max degree of parallesim on the server level on an OLTP server with 24 logical CPUs and 256 GB RAM. It is a pretty busy server. Averaging 13k batch request per second. Non NUMA architecture.
If it is *truly* an OLTP workload - primarily nested loops joins, very few sorts, single-row lookups, optimal indexes, you might choose server DOP 1. This assumes you have enough query traffic to keep 24 workers busy (i.e. RUNNING or RUNNABLE) most of the time. Any odd queries that would benefit from parallelism, during quieter times, could use an explicit MAXDOP > 1 query hint.
It's tough to give blanket recommendations though, as so much depends on your particular workload. Other people are happy to set server MAXDOP to 2, 4, or even higher, perhaps along with an elevated cost threshold. 35-50 is a popular number, but not necessarily with me 🙂
July 26, 2011 at 12:29 pm
This is on an existing Hardware. When we moved into this server, we had two queries that was having parallel plan. I gave hint for just those two. During the past 2 years I started noticing the CXPACKET wait stats going up. The DBA usually review all objects that are released to this server. (We do check for missing indexes, race condition and even object qualifiers). Somehow I missed some SPs that had parallelism and now I see a higher wait on CXPACKET. I was able to identify all queries that have parallelism. It is just a handful. I can explicitly specify MAXDOP for these. But I am not sure that is the way to go.
The max CPU usage is around 40% when it is really busy. usually it stays around 15%.
-Roy
July 26, 2011 at 12:48 pm
Let me point out something very important... I've never tuned a server with anywhere near that much traffic.
That being said here are a few ideas :
It's a little script I wrote to find out if the current cost threshold might be too low (causing extra cpackets waits). While we're at it, what does the cpacket waits look like (http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx)?
SET IMPLICIT_TRANSACTIONS OFF
GO
CREATE TABLE #configs
(
name nvarchar(35),
minimum INT,
maximum INT,
config_value INT,
run_value INT
)
GO
EXEC sp_configure 'show advanced options', 1;
GO
reconfigure;
GO
INSERT INTO #configs (name, minimum, maximum, config_value, run_value)
EXEC sp_configure 'cost threshold for parallelism'
GO
EXEC sp_configure 'show advanced options', 0;
GO
reconfigure;
GO
SELECT
optz.counter
, optz.occurrence
, CONVERT(DECIMAL(18,2), optz.value) AS AvgValue
, conf.name
, conf.config_value
, conf.run_value
, Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]
, CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]
, CASE WHEN optz.value > conf.config_value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]
, CASE WHEN conf.run_value <> conf.config_value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]
FROM
sys.dm_exec_query_optimizer_info optz
CROSS JOIN #configs conf
OUTER APPLY(
SELECT
CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime
FROM
sys.databases
WHERE
name = 'tempdb'
) Uptime
WHERE
optz.counter = 'final cost'
GO
DROP TABLE #configs
I've also just toyed for about 1 hour to tweak this script. It finds the plan that require parallelism to run. Using the temp table you can get weighted averages. It goes without saying that this is a really heavy query to run. It takes over 1 minute to process 10K plans on our little prod server (2 xeon cpus, 4 Gb ram).
This version will get you the AVG weigthed estimated cost of the queries with parallelism. You can compare that with the previous one which is server wide to make a really good guess for the "correct" cost threshold to set on the server.
A final note, as much as I tried I don't have any plans with any info in DegreeOfParallelism. I'm 99.9% certain my code is correct but that's not 100%.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITH XMLNAMESPACES
(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
query_plan AS CompleteQueryPlan,
n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS StatementText,
n.value('(@StatementOptmLevel)[1]', 'VARCHAR(25)') AS StatementOptimizationLevel,
n.value('(@StatementOptmEarlyAbortReason)[1]', 'VARCHAR(25)') AS StatementOptmEarlyAbortReason,
n.value('(@StatementSubTreeCost)[1]', 'DECIMAL(18, 6)') AS StatementSubTreeCost,
n.value('(QueryPlan/@CompileCPU)[1]', 'DECIMAL(18, 0)') AS CompileCPU,
n.value('(QueryPlan/@CompileTime)[1]', 'DECIMAL(18, 0)') AS CompileTime,
n.value('(QueryPlan/@CompileMemory)[1]', 'DECIMAL(18, 0)') AS CompileMemory,
n.value('(QueryPlan/@DegreeOfParallelism)[1]', 'DECIMAL(18, 0)') AS DegreeOfParallelism,
n.query('.').exist('//MissingIndexes') AS HasMissingIndexes,
n.query('.') AS ParallelSubTreeXML,
ecp.usecounts,
ecp.size_in_bytes
INTO #X
FROM sys.dm_exec_cached_plans AS ecp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS eqp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)
WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1
SELECT * FROM #X
SELECT SUM(usecounts * DegreeOfParallelism) / SUM(usecounts) AS Weigthed_Avg_DOP,
SUM(usecounts * StatementSubTreeCost) / SUM(usecounts) AS Weigthed_Avg_SubTreeCost
FROM #X
July 26, 2011 at 1:28 pm
Maybe check out the replies by Grant on another thread can get you started:
http://www.sqlservercentral.com/Forums/FindPost1123444.aspx
I did Grant wrong with this mentioning to read his reply :blush:
I'm sorry Grant :crying:
I should have mentioned "replies" ... I corrected it.
Priceless info Grant.:smooooth: Thank you once more.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
July 26, 2011 at 1:32 pm
ALZDBA (7/26/2011)
Maybe check out the reply by Grant on another thread can get you started:
Yup, I "gut" my script inspiration after that thread. Very nice info.
July 26, 2011 at 1:35 pm
Your second query is still running. But your first query here is the output.
counterfinal cost
occurrence 3189815
AvgValue2.77
namecost threshold for parallelism
config_value5
run_value5
Days Server UPTIME & Last Stats Reset167.50
Stats WarningNULL
Cost Threshold WarningNULL
Restart Warning NULL
And here is the top 10 Wait type stats.
WAIT TYPE PERCENTAGE
CXPACKET 57.72
LATCH_EX 11.88
CMEMTHREAD 4.74
SOS_SCHEDULER_YIELD4.71
ASYNC_NETWORK_IO1172668.813.6782.72
MSQL_XP916031.632.86
PREEMPTIVE_OS_GETPROCADDRESS2.86
OLEDB2.02
TRACEWRITE2.01
BACKUPBUFFER1.19
-Roy
July 26, 2011 at 1:39 pm
See what happens when you rise the cost threshold for parallisme and be surprised, like I was.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
July 26, 2011 at 1:43 pm
Roy Ernest (7/26/2011)
Your second query is still running. But your first query here is the output.counterfinal cost
occurrence 3189815
AvgValue2.77
namecost threshold for parallelism
config_value5
run_value5
Days Server UPTIME & Last Stats Reset167.50
Stats WarningNULL
Cost Threshold WarningNULL
Restart Warning NULL
And here is the top 10 Wait type stats.
WAIT TYPE PERCENTAGE
CXPACKET 57.72
LATCH_EX 11.88
CMEMTHREAD 4.74
SOS_SCHEDULER_YIELD4.71
ASYNC_NETWORK_IO1172668.813.6782.72
MSQL_XP916031.632.86
PREEMPTIVE_OS_GETPROCADDRESS2.86
OLEDB2.02
TRACEWRITE2.01
BACKUPBUFFER1.19
What's the avg wait time in ms for cxpackets?
BTW how come you got 57% for cxpackets and 82.72% for async_network_IO???? That doesn't seem to add up really well!
Here's the modified version I use to include the avgs :
WITH Waits AS
(SELECT
wait_type,
wait_time_ms / 1000.0 AS WaitS,
(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
signal_wait_time_ms / 1000.0 AS SignalS,
waiting_tasks_count AS WaitCount,
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')
)
SELECT
W1.wait_type AS WaitType,
CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,
CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
W1.WaitCount AS WaitCount,
CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold
AND W1.WaitCount > 0;
GO
July 26, 2011 at 1:44 pm
ALZDBA (7/26/2011)
See what happens when you rise the cost threshold for parallisme and be surprised, like I was.
Good or bad surprised? What and how did you trace the changes?
July 26, 2011 at 1:56 pm
I thought I edited it properly. But I guess not.
wait_typewait_time_spct
CXPACKET18468243.6357.72
LATCH_EX3802397.9311.88
CMEMTHREAD1516464.844.74
SOS_SCHEDULER_YIELD1508456.114.71
ASYNC_NETWORK_IO1172668.813.67
MSQL_XP916031.632.86
PREEMPTIVE_OS_GETPROCADDRESS915633.182.86
OLEDB646687.822.02
TRACEWRITE641578.032.01
BACKUPBUFFER381806.381.19
Your Second query ran in 19 min.
Weigthed_Avg_DOPWeigthed_Avg_SubTreeCost
NULL17.060217
Total of 117 rows. Two out of the 117 has timeout as the reason for StatementOptmEarlyAbortReason. There are quite a few that has missing indexes. DegreeOfParallelism column is NULL. But the missing indexes SPs I have to look at and see why it got missed at the time of review.
EDIT (how come you got 57% for cxpackets and 82.72% for async_network_IO) The 82.72% was actually the Running_pct column. The query I use to get the WAIT was done by Glenn Berry. One of his 5 Super DMV queries.
-Roy
July 26, 2011 at 2:04 pm
Hey Roy,
I don't think there's much to choose here between DOP 1 and query hints on the very few queries that would benefit, versus DOP 4 (say) and a higher cost threshold. The effect is likely to be very similar, from what you've told us. If I could see your system and queries for a few hours, I might sway one way or the other, but as it is I say 'meh'. If you have a number of queries that would meet a much higher cost threshold, then by all means go the DOP + cost route. If you have a stable system now where almost all queries are simple and run acceptably fast, I might not take the (small) risk. Only you can really make this decision, though as I say, I don't think there's too much in it either way.
I do want to respond to the comments about CXPACKET though. Very few people understand what CXPACKET truly means (and I do not number storage engine experts among them). For almost all intents and purposes, CXPACKET should be ignored. It is a consequence of running a parallel query, nothing more. People do tend to see CXPACKET 'waits' and assume they mean something similar to PAGEIO_LATCH* or whatever - but they really really don't. Yes, if you run fewer parallel queries, and/or fewer parallel regions, you will see CXPACKET numbers decline. No surprises there. It does not mean your server has become more efficient, or that response time has improved, or much of anything else.
July 26, 2011 at 2:08 pm
Edit : Saw Paul's answer after I posted. I'd deffer to his experience rather than my very limited knowledge with this.
I'm sure you're on top of this but stale stats will increase cxpackets as well.
From what you've sent so far I probabely wouldn't go to 30-50 for the cost threshold, I'd actually try something much lower like 10-15 so to NOT penalize the queries that are using it ATM.
You seem to have either very very low cost queries or relatively cheap "big" queries.
That's why I wanted you to see the real average and then the average of bigger queries. Of course if you use recompile hints you're not going to see those in the cache. @ 13K queries / sec, saving the actual plan with stats is obviously out of the question :w00t:.
The other question is also how long do the plans stay in the cache... that would be another good info to have in those queries.
That bugs me that you don't see anything for the DegreeOfParallelism either. Looks like there's something I don't understand in there (either the caching or the XML query). :unsure:
July 26, 2011 at 2:18 pm
SQLkiwi (7/26/2011)
I do want to respond to the comments about CXPACKET though. Very few people understand what CXPACKET truly means (and I do not number storage engine experts among them). For almost all intents and purposes, CXPACKET should be ignored. It is a consequence of running a parallel query, nothing more. People do tend to see CXPACKET 'waits' and assume they mean something similar to PAGEIO_LATCH* or whatever - but they really really don't. Yes, if you run fewer parallel queries, and/or fewer parallel regions, you will see CXPACKET numbers decline. No surprises there. It does not mean your server has become more efficient, or that response time has improved, or much of anything else.
I'm glad you clarified this Paul. You'd made the comment before on another thread and I had become concerned I'd missed something hugely important. I'd been meaning to do some research but you just eased my mind.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 15 posts - 1 through 15 (of 30 total)
You must be logged in to reply to this topic. Login to reply