September 1, 2010 at 4:46 am
Maybe it would be good to capture a replay trace of the activity for 15 minutes or so while this process runs.
I'd go along with Jason's thoughts here. Look at capturing a profile trace of the activity during this run, then you can get an idea of the queries being run, and look into the execution plans for them. I don't think your problem lies in Tempdb, I'd be inclined to look at the "non-indexed" nature of the database as the issue. 400GB isn't very large, but if the majority of that is in one or two un-indexed tables and these are being joined by some query running in this black box process, then this is more likely to be the cause of your problem.
September 1, 2010 at 5:10 am
Hi again,
How much memory does your server have? Have you monitored memory pressure? Also, what does the storage layer look like in terms of spindle counts, raid levels etc..?
Phillip Cox
September 1, 2010 at 7:16 am
I don't think that the tempdb is the problem either, but the vendor recommended that we add more tempdb data files to see if it helped the process at all. Personally, I think that the vendor's product is full of un-optimized queries and instead of them tuning them, they want to blame our SQL architecture for the problem. So, it is my job to prove that adding more tempdb files won't solve the problem; I need to provide solid evidence to both my manager and the vendor as to why it is not a tempdb problem.
Having 8 equally sized tempdb files, instead of 1, did not help at all. The server has 24 CPUs (hyperthreaded and such) and has 64 GB of RAM. The tempdb is on RAID 5 storage; I know that this is not optimal in any shape or form, but this is my only choice. Knowing that the SAN is RAID 5, would it help with IO to place the tempdb files on local disk? I'm not even sure we have the room to do this, but I am curious. I know that the server sees the SAN disk as "local", but I have yet to see SAN disk before as well as local disk.
And since the process is not currently running, I can't look at allocation contention on the tempdb right now using the query that Phillip provided. Boo. Although, I know there is PageIOLatch_SH going on (which appeared on the server after the process was run this past weekend), based on the results of this query:
WITH Waits AS
(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK'
,'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE'
,'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP','CLR_MANUAL_EVENT'
,'CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT'
,'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN'))
SELECT W1.wait_type,
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold
I have run traces on the process before and it is painful to watch how the application does things. I suppose I could do research on execution plans and such, but isn't that the vendor's job? Shouldn't they be supporting us and tuning their own queries since we're paying them? LOL. I guess this would only happen in a perfect world 🙂
I know that the queries cause a lot of IO. I ran the following query and most of the results were queries performed during the 40-hour process.
SELECT TOP (20) total_logical_reads/execution_count AS [avg_logical_reads],
total_logical_writes/execution_count AS [avg_logical_writes],
total_worker_time/execution_count AS [avg_cpu_cost], execution_count,
total_worker_time, total_logical_reads, total_logical_writes,
(SELECT DB_NAME(dbid) + ISNULL('..' + OBJECT_NAME(objectid), '')
FROM sys.dm_exec_sql_text([sql_handle])) AS query_database,
(SELECT SUBSTRING(est.[text], statement_start_offset/2 + 1,
(CASE WHEN statement_end_offset = -1
THEN LEN(CONVERT(nvarchar(max), est.[text])) * 2
ELSE statement_end_offset
END - statement_start_offset
) / 2)
FROM sys.dm_exec_sql_text(sql_handle) AS est) AS query_text,
last_logical_reads, min_logical_reads, max_logical_reads,
last_logical_writes, min_logical_writes, max_logical_writes,
total_physical_reads, last_physical_reads, min_physical_reads, max_physical_reads,
(total_logical_reads + (total_logical_writes * 5))/execution_count AS io_weighting,
plan_generation_num, qp.query_plan
FROM sys.dm_exec_query_stats
OUTER APPLY sys.dm_exec_query_plan([plan_handle]) AS qp
WHERE [dbid] >= 5 AND (total_worker_time/execution_count) > 100
ORDER BY io_weighting DESC;
So I guess my problem is what is the best way to configure my SQL Server to get this non-optimal application to run better?
September 1, 2010 at 7:25 am
I should also mention that I can not create any of my own indexes on the tables, because it voids the support.
September 1, 2010 at 7:39 am
DBAgal (9/1/2010)
I should also mention that I can not create any of my own indexes on the tables, because it voids the support.
I does not have to.
Assuming you have a dev or QA server, create the indexes you think the process needs on that server and start the data processing task(s).
Tune as long as you see fit then compare performance for the exact same job with and without indexes and then go back to the vendor. It also helps alot if you have a less powerfull QA server that needs 60 hours to run the job instead of 40 and then only need 10 hours after the indexes are changed.
Also one point to "drill" in their skull is that adding indexes cannot break the application in any way. Say that you won't change anything and that you won't add any constraint. Any DBA worth that title is not going to mind that especially if you ask for a test run for 1-2 weeks.
Also if they are going to check or come in for support, there's nothing stopping you from dropping the indexes before they walk in the door ;-).
September 1, 2010 at 8:10 am
Hey,
OK, sound very familar with another application I worked with not so long ago and due to poor query design and vendors approach to keeping on code batch for many RDBMS platforms made life hard!
The PageIOLatch_SH is what SQL Server uses during memory\buffer to disk transfers and if this wait type is paramount in your environment, it would be an idea to look into disk performance via the following counters:
PhysicalDisk\Avg Disk Queue Length
PhysicalDisk\Current Disk Queue Length
If you can expand on the storage setup including vendor, number of volumes for SQL Server (e.g. data, log and tempdb), this will help to narrow things down. You may have an issue with your storage subsystem, as SQL is waiting for buffer items involved in I/O processes.
Please keep us posted.
Phillip Cox
September 1, 2010 at 8:42 am
I've had a sort of the same problem. Here is what we found out.
1) If the server has local disks that are not in use, use it for your TempDB.
2) Don't use any RAID as TempDB will be recreated if needed by SQL server.
3) Our solution was to purchase a SSD (solid state disk) and use it only for TempDB.
With option 3 and only using 1 TempDB we cut the job's execution time from 8 hours to 30mins.
Yes SSD drives costs money, but local disk would be the other approach and is cheap.
We found that the disk setup is more helpful than having many TempDB files.
Help this helps 🙂
Rudy
September 1, 2010 at 10:17 am
DBAgal (9/1/2010)
I don't think that the tempdb is the problem either, but the vendor recommended that we add more tempdb data files to see if it helped the process at all. Personally, I think that the vendor's product is full of un-optimized queries and instead of them tuning them, they want to blame our SQL architecture for the problem. So, it is my job to prove that adding more tempdb files won't solve the problem; I need to provide solid evidence to both my manager and the vendor as to why it is not a tempdb problem.Having 8 equally sized tempdb files, instead of 1, did not help at all. The server has 24 CPUs (hyperthreaded and such) and has 64 GB of RAM. The tempdb is on RAID 5 storage; I know that this is not optimal in any shape or form, but this is my only choice. Knowing that the SAN is RAID 5, would it help with IO to place the tempdb files on local disk? I'm not even sure we have the room to do this, but I am curious. I know that the server sees the SAN disk as "local", but I have yet to see SAN disk before as well as local disk.
And since the process is not currently running, I can't look at allocation contention on the tempdb right now using the query that Phillip provided. Boo. Although, I know there is PageIOLatch_SH going on (which appeared on the server after the process was run this past weekend), based on the results of this query:
WITH Waits AS
(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK'
,'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE'
,'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP','CLR_MANUAL_EVENT'
,'CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT'
,'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN'))
SELECT W1.wait_type,
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold
I have run traces on the process before and it is painful to watch how the application does things. I suppose I could do research on execution plans and such, but isn't that the vendor's job? Shouldn't they be supporting us and tuning their own queries since we're paying them? LOL. I guess this would only happen in a perfect world 🙂
I know that the queries cause a lot of IO. I ran the following query and most of the results were queries performed during the 40-hour process.
SELECT TOP (20) total_logical_reads/execution_count AS [avg_logical_reads],
total_logical_writes/execution_count AS [avg_logical_writes],
total_worker_time/execution_count AS [avg_cpu_cost], execution_count,
total_worker_time, total_logical_reads, total_logical_writes,
(SELECT DB_NAME(dbid) + ISNULL('..' + OBJECT_NAME(objectid), '')
FROM sys.dm_exec_sql_text([sql_handle])) AS query_database,
(SELECT SUBSTRING(est.[text], statement_start_offset/2 + 1,
(CASE WHEN statement_end_offset = -1
THEN LEN(CONVERT(nvarchar(max), est.[text])) * 2
ELSE statement_end_offset
END - statement_start_offset
) / 2)
FROM sys.dm_exec_sql_text(sql_handle) AS est) AS query_text,
last_logical_reads, min_logical_reads, max_logical_reads,
last_logical_writes, min_logical_writes, max_logical_writes,
total_physical_reads, last_physical_reads, min_physical_reads, max_physical_reads,
(total_logical_reads + (total_logical_writes * 5))/execution_count AS io_weighting,
plan_generation_num, qp.query_plan
FROM sys.dm_exec_query_stats
OUTER APPLY sys.dm_exec_query_plan([plan_handle]) AS qp
WHERE [dbid] >= 5 AND (total_worker_time/execution_count) > 100
ORDER BY io_weighting DESC;
So I guess my problem is what is the best way to configure my SQL Server to get this non-optimal application to run better?
What is the max memory setting for SQL Server? It is possible that SQL is consuming all of the memory on the server leaving nothing for the OS which can have an impact on processing. Just one more thing to look at in this case.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
Viewing 8 posts - 16 through 22 (of 22 total)
You must be logged in to reply to this topic. Login to reply