September 12, 2011 at 6:38 am
I'm looking for some expert DBA guidance for tracking down some performance issues (slow running queries) in our data warehouse environment.
The main fact table has 15 million rows so it is not a particularly large database at all. Users are seeing poor performance though so I've taken it upon myself to investigate the causes. I have attached below some Perfmon Counters captured over a two week period.
From the results, what areas would you experienced operators investigate first? To me it looks as if the CPU is not a bottleneck. Likewise, i feel that the memory is not necessarily a problem due to their being very little paging despite high figures for Pages/Sec etc.
The disk subsystem seems to be operating OK given the read and write times too. I'm therefore pointed towards database design (table structures, indexing etc.) as most likely cause but i do not have sufficient experience to have confidence it that assertion.
Any comments would be greatly appreciated!
Many thanks for reading.
Chris
SQL Server: Access Methods - Full Scans3.879
SQL Server: Access Methods - Index Searches1610.422
SQL Server: Access Methods - Page Splits/sec1.815
SQL Server: Access Methods - Workfiles Created2.695
SQL Server: Access Methods - Worktables Created0.338
SQL Server: Access Methods - Table Lock Escalations/sec0.001
SQL Server: Buffer Manager - Free List Stalls/sec0.306
SQL Server: Buffer Manager - Lazy Writes/Sec1.276
SQL Server: Buffer Manager - Checkpoint Pages/sec80.322
SQL Server: Buffer Manager - Page Life Expectancy2194.757
SQL Server: Buffer Manager - Page Lookups/sec20840.106
SQL Server: Buffer Manager - Page Reads/sec3228.451
SQL Server: Buffer Manager - Page Writes/sec129.086
SQL Server: Buffer Manager - Readahead/sec3066.469
SQL Server: SQL Statistics - Batch Requests/Sec71.32
SQL Server: SQL Statistics - SQL Attention Rate/sec0.001
SQL Server: SQL Statistics - SQL Compilations/sec0.282
SQL Server: SQL Statistics - SQL Re-Compilations/sec0.003
SQL Server: Locks - Lock Requests/sec5317.956
SQL Server: Locks - Lock Timeouts/sec0.361
SQL Server: Locks - Number of Deadlocks/sec0
Memory - Available Mbytes271.067
Memory - Pages Input/sec112.921
Memory - Pages/sec120.108
Paging File - %Usage
Paging File - %Usage Peak
System - Processor Queue Length0.2
System - Context Switches/sec5839.471
Processor - % Processor Time12.255
Avg. Disk Sec/Read H:Data warehouse files0.006733
Avg. Disk Sec/Read I:Database Log files0.007272
Avg. Disk Sec/Read P:TempDB0.01147
Avg. Disk Sec/WriteH:Data warehouse files0.2543
Avg. Disk Sec/WriteI:Database Log files0.008935
Avg. Disk Sec/WriteP:TempDB0.03077
September 12, 2011 at 6:52 am
Thanks for the links - they are certainly detailed and ones that i will give careful consideration to.
At the moment however, I am looking for a little guidance on the step before going into the detail described in those guides. I'm looking for some expert backup to my opinion that database design and indexing are the culprits in our environment and as such, we can discount most of the infrastructure as a cause.
The situation I'm in is that I'm not a DBA but have a vested interest in a well performing data warehouse. There is a belief in my team that the poor performance is being caused primarily by 'back-end' problems and not design choices on our side. As such, I'd like to spend more time investigating the design, indexing, identifying the long running queries etc. but until i can point at relevant data that backs up this theory, i can't proceed with analysis like in the links suggest.
This is my first foray into any kind of DBA-type activity and have no expertise within the company to rely on so I'm just looking for a nudge in the right direction from SQL Server Central and not for you to try and solve all our problems from one little set of Perfmon counters 🙂
I'd appreciate any comments or guidance.
Chris
September 12, 2011 at 6:57 am
#1 The links I gave you are exactly to find out the query bottlenecks. So those are extremely relevant to your case.
#2 I'm not perfom expert, not even junior.
However here's what I know for sure. Unless you have a baseline to compare those results to you can't know for sure what is wrong with the server IE what's changed compared to "normal".
That being said here's what I can do beside give you the link to find the bottleneck queries and quite possibly index tuning opportunities.
Run this and post the results :
/*
Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc
Next query by Paul Randal http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx
*/
WITH Waits AS
(SELECT
wait_type,
wait_time_ms / 1000.0 AS WaitS,
(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
signal_wait_time_ms / 1000.0 AS SignalS,
waiting_tasks_count AS WaitCount,
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')
)
SELECT
W1.wait_type AS WaitType,
CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,
CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
W1.WaitCount AS WaitCount,
CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold
AND W1.WaitCount > 0;
GO
September 12, 2011 at 7:07 am
Thanks for the help! It's very much appreciated.
Results for the query you sent are below (Sorry but don't know the best way to post them so I've posted the text and put in a csv text file too).
WaitType, Wait_S, AvgWait_MS, Resource_S, Signal_S, WaitCount, Percentage
CXPACKET5264658.0815.885153445.74111212.3433148284662.68
PAGEIOLATCH_SH1942477.475.201926149.5016327.9737361335123.13
ASYNC_NETWORK_IO370799.479.96367781.173018.30372235234.41
PAGEIOLATCH_EX177316.446.69177111.77204.67265002542.11
SLEEP_BPOOL_FLUSH135805.554.59135014.28791.27296043571.62
ASYNC_IO_COMPLETION122218.0653510.54122216.691.3822841.45
BACKUPBUFFER120044.6112.74119815.39229.2294205391.43
WRITELOG59261.3315.7158794.74466.5937718190.71
September 12, 2011 at 7:14 am
Looks pretty much like an OLAP system!
Please rerun the same query but filter out the CXPACKET too.
For formatting I'd try pasting the results to a txt file and then saving as is. The upload file should keep the formatting intact.
September 12, 2011 at 7:20 am
Our SQL Server is used purely for relational queries (the vast majority are not even ad-hoc queries too but pre-authored reports in our reporting solution - IBM Cognos 8). If these figures make it look like OLAP odes that mean there are many different queries being thrown at SQL that the optimizer is struggling to do any optimization on?
Results attached as requested.
September 12, 2011 at 7:26 am
I'm getting out of my expertize zone here. Entering data gathering phase so that others can take over if I can't thal.
Ok then 2 more queries that have intersting info.
SELECT * FROM sys.dm_exec_query_optimizer_info
And
Don't do anything other than running this!
GO
CREATE TABLE #configs
(
name nvarchar(35),
minimum INT,
maximum INT,
config_value INT,
run_value INT
)
GO
EXEC sp_configure 'show advanced options', 1;
GO
reconfigure;
GO
INSERT INTO #configs (name, minimum, maximum, config_value, run_value)
EXEC sp_configure 'cost threshold for parallelism'
GO
EXEC sp_configure 'show advanced options', 0;
GO
reconfigure;
GO
SELECT
optz.counter
, optz.occurrence
, CONVERT(DECIMAL(18,2), optz.value) AS AvgValue
, conf.name
, conf.config_value
, conf.run_value
, Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]
, CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]
, CASE WHEN optz.value > conf.config_value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]
, CASE WHEN conf.run_value <> conf.config_value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]
FROM
sys.dm_exec_query_optimizer_info optz
CROSS JOIN #configs conf
OUTER APPLY(
SELECT
CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime
FROM
sys.databases
WHERE
name = 'tempdb'
) Uptime
WHERE
optz.counter = 'final cost'
GO
DROP TABLE #configs
September 12, 2011 at 7:30 am
In case you want to read up on cxpacket and why it <often> a non-issue.
Cost threshold for parallelism (CXPACKET)http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx%5B/url%5D
Paul White: Understanding parallelism [url]http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/"> http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx%5B/url%5D
Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/
September 12, 2011 at 7:45 am
Please find attached the output from the first query.
Unfortunately I do not have the permissions to be able to run the second query (changing the config etc.).
Many thanks for the links. I'll take a look!
Chris
September 12, 2011 at 7:50 am
Can you run this version?
O
SELECT
optz.counter
, optz.occurrence
, CONVERT(DECIMAL(18,2), optz.value) AS AvgValue
, conf.name
, conf.value
, conf.value_in_use
, Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]
, CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]
, CASE WHEN optz.value < conf.value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]
, CASE WHEN conf.value_in_use <> conf.value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]
FROM
sys.dm_exec_query_optimizer_info optz
CROSS JOIN sys.configurations conf
OUTER APPLY(
SELECT
CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime
FROM
sys.databases
WHERE
name = 'tempdb'
) Uptime
WHERE
optz.counter = 'final cost'
AND conf.name = 'cost threshold for parallelism'
GO
September 12, 2011 at 7:53 am
I can... attached
Chris
September 12, 2011 at 8:01 am
Are you sure this is OLTP?
I've never seen an avg cost of 36!!!
I'd really look at the top n worst queries by time, IO & cpu. I'm sure you have monster slow movers in there (or big reports you don't know about).
Grant Fritchey would recommend you to go to 45 - 50 for the cost threshold but I wouldn't try that before trying to tune the indexes.
In the mean time you might want to change the ctp to 30 - 35 as this is a "good baseline". But I really think you need to go in to check the costliest queries first.
September 12, 2011 at 8:16 am
It is primarily a SQL server for data warehouse applications based on the IBM Cognos 8 toolset. So within that toolset the majority of SQL interaction we have are read-only queries coming from reports generated within the IBM Cognos environment (these reports generate the SQL through a business layer/model) and the results gathered and rendered on the IBM Cognos servers.
As such, i'd consider the data warehouse to be in read-only mode for most of the time. Updates to the database are done daily using another IBM ETL tool.
We are not using any of the Microsoft BI tools on this at all!
I'm pretty sure that we should be focusing on the indexing and database design (to help tune performance) but as i said earlier, the back-end seems to get the blame first! :ermm:
If 36.5 is very high... what kind of value would you be expecting in a situation I describe above?
I think we have huge gaps in our indexes... if i run the following looking for missing indexes
SELECT TOP 100
ROUND(s.avg_total_user_cost *
s.avg_user_impact
* (s.user_seeks + s.user_scans),0)
AS [Total Cost]
, d.[statement] AS [Table Name]
, equality_columns
, inequality_columns
, included_columns
, s.user_seeks AS [User Seeks]
, s.user_scans AS [User Scans]
, s.last_user_seek as [Last Seek]
, s.avg_user_impact AS [Impact]
, round(s.avg_total_user_cost, 0)as [Avg Cost]
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC
The top 3 results return a 'Total Cost' of 1185013273, 14317430, 12654348
I'm not sure what the avg_total_user_cost refers to in terms of CPU, IO etc. but i know these figures cannot be good! 🙂
Have you got any queries that can quickly show the high cost queries being run for CPU, IO etc?
Thanks,
Chris
September 12, 2011 at 8:21 am
36 is in my experience high for (mainly) oltp. It's not for olap.
As I said earlier. You need to read the 2 articles I posted. It's going to take you maybe 2-4 hours to get the answer you need, but it's going to server you for the rest of your life.
There's no magic bullet here. You need to find the costliest queries and make then cost less. I can come in for a consult if you want but I don't think you need me here.
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply