October 5, 2009 at 11:08 am
I am trying to track down odd resource utilization on one of my SQL Server 2005 machines and I am starting to question some of the data I am getting from profiler. Here's a very high level on my issue.
The issue:
I have a SQL 2005 box with 125 DB's and many apps that connect to it. After a server restart CPU utilization is very low. In fact, this is a VERY beefy box so I wouldn't expect actual CPU use to be very high. Also, this is not a primary OLTP server so not nearly as many transactions as 125 DB's might normally imply. Anyway, the server will hum along for one week, maybe 2, averaging about 10% total CPU use. Eventually, "something" happens in my environment and the server will begin to average 40 to 80% CPU use. After this "something" has occurred, CPU averages will remain high until the sqlserver.exe service is restarted. Even in the middle of the night it will average 40% CPU use with no one even on it.
Troubleshooting:
Weeks ago I started capturing baseline profiler traces..and problematic traces. My problem is I have never been able to find a smoking gun. Over similar time periods, the high CPU trace is VERY similar to the baselines with +-15% average increases in number of statements etc. Never being able to find something obvious, I figured I might be suffering from a bug of sorts so I got the machine up to date with SP2 for Windows Server 2003 and SP3 for SQL 2005.
Unfortunately, my issue still exists so I am back to Profiler...which brings me to the actual point of this post (sorry it has been so drawn out). I have always been under the impression that I should be able to capture all RPC Completed events (yes, I know I will miss some adhoc SQL but there should be very little of this. I have a couple apps that don't always use Procs but at least utilize sp_executesql coming from the client) and that the SUM of CPU use should essentially equal what the OS is reporting for sqlserver.exe I know it wouldn't be exact...but close, right?
In my case it is not close at all. And, I have lots of records where CPU use is zero. Should I be looking at this differently. Is there a better method for footing Profiler CPU use to OS stats so I can figure out what this SQL server is wasting it's time doing when things go awry?
Thanks. RH
October 5, 2009 at 1:04 pm
Maybe your stats are going out of date and lots of bad plan/compile is happening. Check how much memory you have. If there is not enough memory, then it will have to do lots of IO operations as well.
Just my 2 cents
-Roy
October 5, 2009 at 1:20 pm
CPU usage within Profiler would not be a good measure for overall CPU use within the system. It's only going to give you information about the execution of the procedures that you're capturing, not any other processes that are occurring, and there could be a lot of those. I'd suggest looking up one of the articles on using wait states and queues as a performance metric. These let you identify what is causing a slow down. From there you can determine which queries are causing it.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 5, 2009 at 1:26 pm
Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.
Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.
Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.
I have no idea what these counters look like when the server is behaving.
So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.
Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.
Thanks. RH
October 5, 2009 at 1:33 pm
Grant, service levels are currently being adhered to...no queries are excessively blocking...or being blocked. My user environment doesn't ever notice that CPU usage goes awry. Recall that the difference in Profile workload and charteristics between when the server is having a problem and the baseline are very similar. Although, I must admit that the culling through 400K to 600K of profiler records from 60 minutes duration looking for a smoking gun is a bit tough.
As I am typing this my page life expectency went from 1200 seconds to 71 in one second. Maybe all the processor I am seeing is the system feeding the buffer pool over and over again.
I think I need to find what is causing that buffer purge.
October 5, 2009 at 1:34 pm
rhunt (10/5/2009)
Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.
Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.
I have no idea what these counters look like when the server is behaving.
So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.
Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.
Thanks. RH
30 seconds page life expectancy is still pretty good, but yeah, something is swacking your cache. I have a server that does something similar, but it's on a 24 hour cycle. It just keeps growing til about 6AM and then drops to 3 seconds and starts rebuilding again. Mine is a data load running at that time. Look for large data loads or reports... Something along those lines that is moving a lot of data.
A cache hit ratio that low means almost everything coming in is not getting plan reuse. I haven't watched our Sharepoint db to see if this is normal or not. W/O knowledge, I'd say that's an indicator of a problem, but I'd need to test it to be sure.
From the sounds of it, again, I'd go after waits & queues to identify what is slowing do the system.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 5, 2009 at 10:29 pm
Owing to the random nature of the problem and the fact than RPC Complete isn't showing anything (the bad thing has to actually complete to have RPC Complete capture it), it sounds like one of two things...
Either a use is asking for something very large and ungainly or a particular task has an accidental cross-join in it.
The first thing to do is determine if it's actually SQL Server or Sharepoint or something else. Sounds stupid simple, but the first thing I look at is TaskMgr, particularlly the Processes tab. Follow your nose after that.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 6, 2009 at 6:26 am
It's all sqlserver.exe. This is a dedicated sql box with a couple SharePoint boxes (and other IIS boxes) connecting over the network.
Thanks. RH
October 7, 2009 at 2:25 pm
We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.
When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.
Chris
October 7, 2009 at 3:05 pm
Chris Howarth-536003 (10/7/2009)
We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.
Chris
BWAA-HAA!!! Wouldn't it be ironic if the /3GB switch had a "memory leak". 😀
Good tip. Thanks for posting it.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 7, 2009 at 7:58 pm
You had me excited...sounded like the ticket...just couldn't remember if we officially took the /3GB dive back when we enabled PAE and AWE. So, I checked my boot.ini and it's looking to me like we're not using /3GB switch.
Boot.ini:
[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003, Enterprise" /noexecute=optout /fastdetect /PAE
Maybe I'll do the opposite...enable it and see if the behavior changes.
October 7, 2009 at 11:38 pm
Just to check the obvious - have you rebooted the server since removing the /3GB switch?
Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.
I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]
SET NOCOUNT ON
;WITH CTE
AS
(
SELECT CASE WHEN region_state = 0x00001000
THEN 'Commit'
WHEN region_state = 0x00002000
THEN 'Reserve'
WHEN region_state = 0x00010000
THEN 'Free'
END AS 'Type',
MIN(region_size_in_bytes) AS Minimum,
MAX(region_size_in_bytes) AS Maximum,
SUM(region_size_in_bytes) / COUNT(*) AS Average,
COUNT(*) AS [Blk count],
SUM(region_size_in_bytes) AS Total,
CASE WHEN region_state = 0x00001000
THEN 1
WHEN region_state = 0x00002000
THEN 2
WHEN region_state = 0x00010000
THEN 3
END AS SortOrder
FROM sys.dm_os_virtual_address_dump
GROUP BY region_state
UNION ALL
SELECT CASE WHEN region_type = 0x00020000
THEN 'Private'
WHEN region_type = 0x00040000
THEN 'Mapped'
WHEN region_type = 0x01000000
THEN 'Image'
END AS 'Type',
MIN(region_size_in_bytes) AS Minimum,
MAX(region_size_in_bytes) AS Maximum,
SUM(region_size_in_bytes) / COUNT(*) AS Average,
COUNT(*) AS [Blk count],
SUM(region_size_in_bytes) AS Total,
CASE WHEN region_type = 0x00020000
THEN 4
WHEN region_type = 0x00040000
THEN 5
WHEN region_type = 0x01000000
THEN 6
END
FROM sys.dm_os_virtual_address_dump
WHERE region_allocation_base_address <> 0x0
GROUP BY region_type
)
SELECT Type,
Minimum,
Maximum,
Average,
[Blk count],
Total
FROM CTE
ORDER BY SortOrder
GO
It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.
In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:
http://support.microsoft.com/kb/937277
There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?
Cheers
Chris
October 8, 2009 at 5:21 am
Try running the following
select name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed
from sys.dm_os_memory_clerks
group by name
order by 2 desc
with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans
October 8, 2009 at 6:28 am
Eoin (10/8/2009)
Try running the followingselect name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed
from sys.dm_os_memory_clerks
group by name
order by 2 desc
with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans
There were 300 items. Here are the top 10:
NAME MB Used
Default55.578125
SOS_Node48.312500
SchemaMgr Store12.093750
SNIPacket5.664062
Bound Trees2.554687
TokenAndPermUserStore2.304687
SystemRowsetStore1.093750
Lock Manager : Node 01.093750
Broker dormant rowsets0.859375
Object Plans0.601562
October 8, 2009 at 6:34 am
Chris Howarth-536003 (10/7/2009)
Just to check the obvious - have you rebooted the server since removing the /3GB switch?Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.
I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]
SET NOCOUNT ON
;WITH CTE
AS
(
SELECT CASE WHEN region_state = 0x00001000
THEN 'Commit'
WHEN region_state = 0x00002000
THEN 'Reserve'
WHEN region_state = 0x00010000
THEN 'Free'
END AS 'Type',
MIN(region_size_in_bytes) AS Minimum,
MAX(region_size_in_bytes) AS Maximum,
SUM(region_size_in_bytes) / COUNT(*) AS Average,
COUNT(*) AS [Blk count],
SUM(region_size_in_bytes) AS Total,
CASE WHEN region_state = 0x00001000
THEN 1
WHEN region_state = 0x00002000
THEN 2
WHEN region_state = 0x00010000
THEN 3
END AS SortOrder
FROM sys.dm_os_virtual_address_dump
GROUP BY region_state
UNION ALL
SELECT CASE WHEN region_type = 0x00020000
THEN 'Private'
WHEN region_type = 0x00040000
THEN 'Mapped'
WHEN region_type = 0x01000000
THEN 'Image'
END AS 'Type',
MIN(region_size_in_bytes) AS Minimum,
MAX(region_size_in_bytes) AS Maximum,
SUM(region_size_in_bytes) / COUNT(*) AS Average,
COUNT(*) AS [Blk count],
SUM(region_size_in_bytes) AS Total,
CASE WHEN region_type = 0x00020000
THEN 4
WHEN region_type = 0x00040000
THEN 5
WHEN region_type = 0x01000000
THEN 6
END
FROM sys.dm_os_virtual_address_dump
WHERE region_allocation_base_address <> 0x0
GROUP BY region_type
)
SELECT Type,
Minimum,
Maximum,
Average,
[Blk count],
Total
FROM CTE
ORDER BY SortOrder
GO
It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.
In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:
http://support.microsoft.com/kb/937277
There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?
Cheers
Chris
Thanks. I'm actually not at all familiar with VAS fragmentation so I'll have to research the topic in general. Thanks for the tip. Answers to some of your questions:
1) We're not using any extended stored procedures in our custom code. However, not sure if SharePoint makes use of them...I'd have to research
2) We're on SP3 (original...no pose CUME's)
3) We do make use of linked servers. The linked server with the most activity is to our AS400. SSIS does a lot of work across that link each night.
4) Turns out the /3Gb switch was never enabled on this server...just PAE and AWE.
5) Here are the results of your query...although I'm not entirely sure how I should use these results:
TYPE MIN MAX AVG Blk Count Total
Commit409684672512 48258 10609 511979520
Reserve40964157440 183412 8179 1500131328
Free40963145728 224018 604 135307264
Private409684672512 105801 18208 1926438912
Mapped40961060864 107751 62 6680576
Image409624948736 141561 558 78991360
Viewing 15 posts - 1 through 15 (of 31 total)
You must be logged in to reply to this topic. Login to reply