Strange processor usage with SQL 2005 64-bit

  • Hello

    In our company we installed our first 64-bit SQL 2005 Cluster.

    Server IBM LS21, 2 CPU AMD Opteron (Dual Core), 4 GB Memory, Windows Std.

    I installed SQL 2005 Cluster as active/active.

    My problem is the CPU usage on one node. I noticed an usage of 25 % without any activities on SQL (it is a dedicated cluster). The strange thing is that CPU 3 and 4 are working CPU 1 and 2 have basically no activity. After running performance counter against processor and process activitie I found out that CPU 3 has a load of 100% through SQL. If I configure the CPU affinity to not use CPU 3 the load pass to CPU 4. Configuring the affinity to use CPU 1 and 2 I got a load of 1% / 2%.

    On the other node I do not have the same behaviour, but memory usage is higher (3.4 GB).

    I spent a few hours searching in the internet without result. Basically CPU affinity should be configured for system with 8 CPU or more.

    Did anyone made the same expirience or do somebody have a suggestion ?

    Thank you in advance

    Moreno

  • Hi Moreno

    Sounds a fun one  Have set Affinity Masking?  If you want sql to use all processors leave it as default.. ..you only use affinity masking when you don't want to use all the processors or you have a reason to bind specific schedulers to specific processors.   If you don't think you have, try running the following query... 

    SELECT s.scheduler_id AS SchedulerID, s.cpu_id AS CPUID, s.is_online AS IsOnline, s.current_tasks_count AS CurrentTasks, s.runnable_tasks_count AS CurrentRunnableTasks, s.load_factor AS LoadFactor FROM sys.dm_os_schedulers s

    If CPUID is anything other than 255 you're schedulers are bound to specific cpu's, if any schedulers have a value of 0 for IsOnline Affinity Masking has been set to stop sql using certain processors.  Any value > 0 on CurrentRunnableTasks will mean the problem is causing a cpu bottleneck on sql.  Also might be interesting to look at Loadfactor, this is used internally to balance the load on the schedulers.

    If your happy affinity masking hasn't been set and the schedulers are free to use all and any cpu's... ...open regedit and create a DWORD value called ThreadNameFormat under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PerfProc\Performance subkey and set its value to 2 (this enables you to see the os thread numbers in perfmon)... ...open perfmon, look at the thread performance object and add a % Processor Time counter for each sql thread... ...highlight any threads that look to be significantly higher than any others, make a note of the thread number and then run this query using the thread number in the where clause...

    SELECT

    STasks.session_id, SThreads.os_thread_id, b.command FROM sys.dm_os_tasks AS STasks INNER JOIN sys.dm_os_threads AS SThreads ON STasks.worker_address = SThreads.worker_address LEFT OUTER JOIN sys.dm_exec_requests b ON STasks.session_id = b.session_id WHERE STasks.session_id IS NOT NULL ORDER BY STasks.session_id

    This may show you which spid may be the cause of the issue along with what it's running, potentially you could use this information on other dmv's to narrow right down to a particular sql statement that may be causing you an issue...

    Hope this helps

    Ben

  • Hi Ben

    Thank you for the useful advice. After localizing the thread loading one CPU to 100%, I run the queries against the server. The result is not very encouraging: no sql handle, no statement, no blocking...ect.

    I' m afraid that something with the hardware or os configuration is wrong. Well I keep searching.

    Thanks again

    Moreno

  • Hi Moreno, Ben

    I have been experiencing a similar problem on a dev box.  The server is a HP585 with 4x AMD Opteron 874 64 Dual Core, Windows 2003 dev edition, SQL Server 2005 database engine and SSIS only.  Everything is patched up to date.  I have not modified any CPU affinity settings

    I noticed that CPU usage by sqlservr.exe had remained stable at 25% for over a day, despite zero apparent sql server activity, and that the load was alternating between bursts of 100% cpu usage on CPUs 4,5,6,7.

    I could see no sign of database activity via the activity monitor in management studio, and ran profiler for all event classes in turn to try and get some idea of what SQL server was doing to use this cpu resource.  All I could find was a great deal of GhostCleanupTask transactions running against a consistent subset of the databases on the server.  Approx 50 of these txns are starting at the exact same time as each other, and all are ending with a duration of 0.  the whole process repeats after 5 seconds.  I cannot find a lot of information about the GhostCleanuptask, and I have no idea why only a certain set of the databases are being affected in this way.  Certainly there is nothing to distinguish these database from others on the server that are not being 'Cleaned up' every 5 secs.

    I doubt that the repitition of this system task could account for the high and consistent cpu usage, but I can find no other signs of sql server activity.

    For a brief period usage on CPUs 4 and 5 dropped off to zero. CPUs 6 and7 remained busy for no apparent reason and the overall cpu  used by sqlservr.exe fell to 13% for this period. 

    I followed Ben's suggestion and the first query returns the following - indicating a CPU bottleneck on SQL.

    SchedulerID CPUID  IsOnline CurrentTasks CurrentRunnableTasks LoadFactor

    ----------- ------ -------- ------------ -------------------- -----------

    0           255    1        3            0                    3

    1           255    1        5            0                    3

    257         255    1        1            0                    1

    2           255    1        3            0                    3

    3           255    1        5            0                    3

    258         255    1        1            0                    1

    4           255    1        3            0                    3

    5           255    1        7            1                    5

    259         255    1        1            0                    1

    6           255    1        2            0                    4

    7           255    1        4            0                    3

    260         255    1        1            0                    1

    255         255    1        2            0                    1

    In perfmon I identified two threads consistently using 100% cpu time.  These turned out to be two lazy writer threads.

     

    session_id os_thread_id command

    6          3844         LAZY WRITER

    8          3848         LAZY WRITER

     

    I found a blog by Slava Oks on "spinlocks" where he shows that a system thread like lazywriter can result in high cpu usage. http://blogs.msdn.com/slavao/archive/2005/11/12/492119.aspx

    The problem disappears when I close down SQL server Mangement Studio.  ie CPU drops to 0%.  even though the sql server service is still running.  As soon as management studio is opened, CPU back up to 25% - even before connecting to any database! 

    I cannot shed any light on what is going on, but I hope this provides some clues.  Please post if you can figure out what is going on.

    Cheers,

    Jeremy

     

  • on a multi proc/core box you won't get loading across the processors unless there's enough work for the box to do, under light load most of the cores/procs will sit there doing nothing. Don't know about the ms issue, can't say I've seen this. UMSSTATS can help you see the work loads for the procs/cores.

    Can I assume your sql setup is out of the box ? I wasn't able to reproduce your cpu pattern on my boxes ( but I don't have such hardware as yours to hand just now )

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Hi there

    first at all, thanks for the replies...it seems that I have posted an interesting problem :-))

    I analyzed the problem further finding out (thank you Ben) that just one thread is causing the load on one core only. Unfortunatelly, as the server is still not in use, the only thing I could find out is that the lazy writer is involved.

    The installation is out of the box. I submitted the case to Microsoft and they are still searching a solution.

    As soon I will get a solution (I hope so), I will post it here.

    Thanks a lot again for your help and advices.

    Regards Moreno

  • Hi guys,

    Having the same problem here, the post from Ben regarding the registry key and system views was great, helped me determine that the lazy writer was the culprit in my case. I have raised this with Microsoft Premier Support, and they have informed me that a new hot fix has just been finished, titled "High CPU consumption by LazyWriter process on a NUMA enabled SQL Server 2005 x64", which will address this issue. It is being released within the next week. Will let you know if it fixes the issue I am seeing,

    Ed

  • Hi guys

    from Microsoft I got the same hotfix like Edward, but it worked on one node of my cluster only.....

    Finally I got from Microsoft an other Hotfix. This is still not documented and it is a collection of fixes which are not included in SP2. The build no. will be 2216 and the cpu problem will be referred in the KB Article Number(s): 931820, 931821.

    In fact it seems to be a problem of sql internally. Sometimes the lazy writer starts to consume CPU because it cannot handle the memory. Well I know it is a strange explanation, but it is what I got from the supporter. BTW this kind of problem can affect x86 systems as well.

    After a 4 weeks tests and so on, I am happy the cluster is working fine and the customers too :-))))

    Regards,

    Moreno

  • Hi Moreno

    I have the same problem with my cluster, I spoke with my microsoft support and the send me the SP2 and fixes to 931820,

    I was install the first the sp2 (in 2 cluster nodes) and now I can´t install the 931820 hotfix. I can´t select the instance name in setup process

    How you installed the hotfix.

    Thank you

  • Hi Ivan

    I have not installed SP2 yet. According to Microsoft 931820 should not be included in SP2. They advice me to install the hotfix again in case I'm going to upgrade to SP2.

    It may sound like a stupid question, but have you restarted the nodes ? If yes make sure you choose the right sequence installing the hotfix. You should have 6 exe files, start with the sql2005_kb number, then

    AS if you have installed analisys services

    DTS for SSIS

    NS for notificatin services

    RS reporting services

    sqltools2005 for client tools

    If now, I advice you to restart.

    I hope it help.

    Regards

    Moreno

  • Hello,

    Where I can download the hotfix 931820? I can not find it on Microsoft site.

    Thanks,

    Ye

  • You have to ask Microsoft directly for it. It is not public. However have not checked now if it may be included in SP2a.

    Regards,

    Moreno

  • I have same issue till now. This really does not help. Even though after running hotfixes which I got directly from Mircrosoft. After running around, I found that GhostCleanupTask might be the main culprit. Nobody has answer how to avoid this. This GhostCleanupTask is mainly depended on Full Text Search service. I am running select statement over million of transactions to do insert/delete/update, when I reboot server it takes 15 mins-few hrs, but after that it becomes very very slow which takes me 15-24hrs to complete that task. Paging/Indexing/Updatestat after run also no use.

  • Uh, is it really the same Problem ? Perhaps you should try to check the server performance with perfmon to find out if the HW has some problem. In my case I got only one CPU with heavy load without any operations running.

    Moreno

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply