CPU Usage Reported By Profiler

  • I am trying to track down odd resource utilization on one of my SQL Server 2005 machines and I am starting to question some of the data I am getting from profiler. Here's a very high level on my issue.

    The issue:

    I have a SQL 2005 box with 125 DB's and many apps that connect to it. After a server restart CPU utilization is very low. In fact, this is a VERY beefy box so I wouldn't expect actual CPU use to be very high. Also, this is not a primary OLTP server so not nearly as many transactions as 125 DB's might normally imply. Anyway, the server will hum along for one week, maybe 2, averaging about 10% total CPU use. Eventually, "something" happens in my environment and the server will begin to average 40 to 80% CPU use. After this "something" has occurred, CPU averages will remain high until the sqlserver.exe service is restarted. Even in the middle of the night it will average 40% CPU use with no one even on it.

    Troubleshooting:

    Weeks ago I started capturing baseline profiler traces..and problematic traces. My problem is I have never been able to find a smoking gun. Over similar time periods, the high CPU trace is VERY similar to the baselines with +-15% average increases in number of statements etc. Never being able to find something obvious, I figured I might be suffering from a bug of sorts so I got the machine up to date with SP2 for Windows Server 2003 and SP3 for SQL 2005.

    Unfortunately, my issue still exists so I am back to Profiler...which brings me to the actual point of this post (sorry it has been so drawn out). I have always been under the impression that I should be able to capture all RPC Completed events (yes, I know I will miss some adhoc SQL but there should be very little of this. I have a couple apps that don't always use Procs but at least utilize sp_executesql coming from the client) and that the SUM of CPU use should essentially equal what the OS is reporting for sqlserver.exe I know it wouldn't be exact...but close, right?

    In my case it is not close at all. And, I have lots of records where CPU use is zero. Should I be looking at this differently. Is there a better method for footing Profiler CPU use to OS stats so I can figure out what this SQL server is wasting it's time doing when things go awry?

    Thanks. RH

  • Maybe your stats are going out of date and lots of bad plan/compile is happening. Check how much memory you have. If there is not enough memory, then it will have to do lots of IO operations as well.

    Just my 2 cents

    -Roy

  • CPU usage within Profiler would not be a good measure for overall CPU use within the system. It's only going to give you information about the execution of the procedures that you're capturing, not any other processes that are occurring, and there could be a lot of those. I'd suggest looking up one of the articles on using wait states and queues as a performance metric. These let you identify what is causing a slow down. From there you can determine which queries are causing it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.

    Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.

    Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.

    I have no idea what these counters look like when the server is behaving.

    So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.

    Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.

    Thanks. RH

  • Grant, service levels are currently being adhered to...no queries are excessively blocking...or being blocked. My user environment doesn't ever notice that CPU usage goes awry. Recall that the difference in Profile workload and charteristics between when the server is having a problem and the baseline are very similar. Although, I must admit that the culling through 400K to 600K of profiler records from 60 minutes duration looking for a smoking gun is a bit tough.

    As I am typing this my page life expectency went from 1200 seconds to 71 in one second. Maybe all the processor I am seeing is the system feeding the buffer pool over and over again.

    I think I need to find what is causing that buffer purge.

  • rhunt (10/5/2009)


    Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.

    Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.

    Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.

    I have no idea what these counters look like when the server is behaving.

    So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.

    Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.

    Thanks. RH

    30 seconds page life expectancy is still pretty good, but yeah, something is swacking your cache. I have a server that does something similar, but it's on a 24 hour cycle. It just keeps growing til about 6AM and then drops to 3 seconds and starts rebuilding again. Mine is a data load running at that time. Look for large data loads or reports... Something along those lines that is moving a lot of data.

    A cache hit ratio that low means almost everything coming in is not getting plan reuse. I haven't watched our Sharepoint db to see if this is normal or not. W/O knowledge, I'd say that's an indicator of a problem, but I'd need to test it to be sure.

    From the sounds of it, again, I'd go after waits & queues to identify what is slowing do the system.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Owing to the random nature of the problem and the fact than RPC Complete isn't showing anything (the bad thing has to actually complete to have RPC Complete capture it), it sounds like one of two things...

    Either a use is asking for something very large and ungainly or a particular task has an accidental cross-join in it.

    The first thing to do is determine if it's actually SQL Server or Sharepoint or something else. Sounds stupid simple, but the first thing I look at is TaskMgr, particularlly the Processes tab. Follow your nose after that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It's all sqlserver.exe. This is a dedicated sql box with a couple SharePoint boxes (and other IIS boxes) connecting over the network.

    Thanks. RH

  • We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.

    When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.

    Chris

  • Chris Howarth-536003 (10/7/2009)


    We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.

    When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.

    Chris

    BWAA-HAA!!! Wouldn't it be ironic if the /3GB switch had a "memory leak". 😀

    Good tip. Thanks for posting it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You had me excited...sounded like the ticket...just couldn't remember if we officially took the /3GB dive back when we enabled PAE and AWE. So, I checked my boot.ini and it's looking to me like we're not using /3GB switch.

    Boot.ini:

    [boot loader]

    timeout=30

    default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS

    [operating systems]

    multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003, Enterprise" /noexecute=optout /fastdetect /PAE

    Maybe I'll do the opposite...enable it and see if the behavior changes.

  • Just to check the obvious - have you rebooted the server since removing the /3GB switch?

    Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.

    I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]

    SET NOCOUNT ON

    ;WITH CTE

    AS

    (

    SELECT CASE WHEN region_state = 0x00001000

    THEN 'Commit'

    WHEN region_state = 0x00002000

    THEN 'Reserve'

    WHEN region_state = 0x00010000

    THEN 'Free'

    END AS 'Type',

    MIN(region_size_in_bytes) AS Minimum,

    MAX(region_size_in_bytes) AS Maximum,

    SUM(region_size_in_bytes) / COUNT(*) AS Average,

    COUNT(*) AS [Blk count],

    SUM(region_size_in_bytes) AS Total,

    CASE WHEN region_state = 0x00001000

    THEN 1

    WHEN region_state = 0x00002000

    THEN 2

    WHEN region_state = 0x00010000

    THEN 3

    END AS SortOrder

    FROM sys.dm_os_virtual_address_dump

    GROUP BY region_state

    UNION ALL

    SELECT CASE WHEN region_type = 0x00020000

    THEN 'Private'

    WHEN region_type = 0x00040000

    THEN 'Mapped'

    WHEN region_type = 0x01000000

    THEN 'Image'

    END AS 'Type',

    MIN(region_size_in_bytes) AS Minimum,

    MAX(region_size_in_bytes) AS Maximum,

    SUM(region_size_in_bytes) / COUNT(*) AS Average,

    COUNT(*) AS [Blk count],

    SUM(region_size_in_bytes) AS Total,

    CASE WHEN region_type = 0x00020000

    THEN 4

    WHEN region_type = 0x00040000

    THEN 5

    WHEN region_type = 0x01000000

    THEN 6

    END

    FROM sys.dm_os_virtual_address_dump

    WHERE region_allocation_base_address <> 0x0

    GROUP BY region_type

    )

    SELECT Type,

    Minimum,

    Maximum,

    Average,

    [Blk count],

    Total

    FROM CTE

    ORDER BY SortOrder

    GO

    It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.

    In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:

    http://support.microsoft.com/kb/937277

    There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?

    Cheers

    Chris

  • Try running the following

    select name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed

    from sys.dm_os_memory_clerks

    group by name

    order by 2 desc

    with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans

  • Eoin (10/8/2009)


    Try running the following

    select name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed

    from sys.dm_os_memory_clerks

    group by name

    order by 2 desc

    with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans

    There were 300 items. Here are the top 10:

    NAME MB Used

    Default55.578125

    SOS_Node48.312500

    SchemaMgr Store12.093750

    SNIPacket5.664062

    Bound Trees2.554687

    TokenAndPermUserStore2.304687

    SystemRowsetStore1.093750

    Lock Manager : Node 01.093750

    Broker dormant rowsets0.859375

    Object Plans0.601562

  • Chris Howarth-536003 (10/7/2009)


    Just to check the obvious - have you rebooted the server since removing the /3GB switch?

    Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.

    I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]

    SET NOCOUNT ON

    ;WITH CTE

    AS

    (

    SELECT CASE WHEN region_state = 0x00001000

    THEN 'Commit'

    WHEN region_state = 0x00002000

    THEN 'Reserve'

    WHEN region_state = 0x00010000

    THEN 'Free'

    END AS 'Type',

    MIN(region_size_in_bytes) AS Minimum,

    MAX(region_size_in_bytes) AS Maximum,

    SUM(region_size_in_bytes) / COUNT(*) AS Average,

    COUNT(*) AS [Blk count],

    SUM(region_size_in_bytes) AS Total,

    CASE WHEN region_state = 0x00001000

    THEN 1

    WHEN region_state = 0x00002000

    THEN 2

    WHEN region_state = 0x00010000

    THEN 3

    END AS SortOrder

    FROM sys.dm_os_virtual_address_dump

    GROUP BY region_state

    UNION ALL

    SELECT CASE WHEN region_type = 0x00020000

    THEN 'Private'

    WHEN region_type = 0x00040000

    THEN 'Mapped'

    WHEN region_type = 0x01000000

    THEN 'Image'

    END AS 'Type',

    MIN(region_size_in_bytes) AS Minimum,

    MAX(region_size_in_bytes) AS Maximum,

    SUM(region_size_in_bytes) / COUNT(*) AS Average,

    COUNT(*) AS [Blk count],

    SUM(region_size_in_bytes) AS Total,

    CASE WHEN region_type = 0x00020000

    THEN 4

    WHEN region_type = 0x00040000

    THEN 5

    WHEN region_type = 0x01000000

    THEN 6

    END

    FROM sys.dm_os_virtual_address_dump

    WHERE region_allocation_base_address <> 0x0

    GROUP BY region_type

    )

    SELECT Type,

    Minimum,

    Maximum,

    Average,

    [Blk count],

    Total

    FROM CTE

    ORDER BY SortOrder

    GO

    It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.

    In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:

    http://support.microsoft.com/kb/937277

    There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?

    Cheers

    Chris

    Thanks. I'm actually not at all familiar with VAS fragmentation so I'll have to research the topic in general. Thanks for the tip. Answers to some of your questions:

    1) We're not using any extended stored procedures in our custom code. However, not sure if SharePoint makes use of them...I'd have to research

    2) We're on SP3 (original...no pose CUME's)

    3) We do make use of linked servers. The linked server with the most activity is to our AS400. SSIS does a lot of work across that link each night.

    4) Turns out the /3Gb switch was never enabled on this server...just PAE and AWE.

    5) Here are the results of your query...although I'm not entirely sure how I should use these results:

    TYPE MIN MAX AVG Blk Count Total

    Commit409684672512 48258 10609 511979520

    Reserve40964157440 183412 8179 1500131328

    Free40963145728 224018 604 135307264

    Private409684672512 105801 18208 1926438912

    Mapped40961060864 107751 62 6680576

    Image409624948736 141561 558 78991360

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply