Tracing a performance issue

  • Hello,

    We have a SQL agent job that runs daily. It runs a procedure, which for the purposes of this post, I will refer to as 'Procedure A'. That procedure runs other procedures.

    Every few months we get an issue where it appears to have stalled. Task manager is showing the SQL Server process at close to 100% so it doesn't seem to be non SQL things taking server resources.

    I'm struggling to find the cause and would appreciate some suggestions on what to do next time to try and find the cause or what concept I've not understand and need to go and learn about (I have done some research to try and work it out myself but not found the answer).

    I use the following query to try and establish what's happening:

    SELECT er.session_id,es.host_name,es.program_name, login_name, er.status,command, er.blocking_session_id,er.wait_type,er.wait_time, 
    DB_NAME(er.database_id) AS db_name,
    text, query_plan
    FROM sys.dm_exec_sessions es
    INNER join sys.dm_exec_requests er ON es.session_id=er.session_id
    outer APPLY sys.dm_exec_sql_text(er.sql_handle)
    outer APPLY sys.dm_exec_query_plan(er.plan_handle)
    WHERE er.session_id<>@@SPID
    AND host_name IS NOT null

    This is the relevant columns from the results:

    Sessions

    The wait_type for session 52 makes sense, it's waiting for the other sessions.

    Session 77 - That doesn't appear to be waiting for anything but it had been running for hours when it should be minutes.

    I suspect session 77 is the root cause of the issue but I can't prove it. I wonder if it's taking all of the cpu cores but I've not found a way to prove that?

    I thought about the possibility of a bad execution plan (the data changes every day) but if I look at the plan it looks the same as usual so I don't think it's that but I could be wrong.

    The wait_times for sessions 53 and 54 are much higher than they would normally be (the procedures should complete in a few minutes). If my understanding is correct CXpacket and cxconsumer are parallelism. Maybe they are waiting cpu cores but then they would be in status runnable?

    If locks were an issue then I think the sessions would show as being blocked?

    Or possibly my query could be excluding the real cause. I have tried doing things like just looking at the tables individually and not found anything that looks relevant. Or possibly I'm taking completely the wrong approach?

    My current fix is to stop the SQL agent job and restart it. It then completes OK but I'd like to properly understand what's happening.

    Thanks

  • The SELECT INTOs will hold locks.  Those might be causing issues.

    One good way to avoid that is to an initial SELECT TOP (0) ... INTO to create the table, then use INSERT INTO ... SELECT to actually load the table.  You can still get minimal logging, if otherwise available, but you won't hold as many meta locks in SQL Server that way.

     

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • There are a couple things in your post that may need more clarification:

    as_1234 wrote:

    Task manager is showing the SQL Server process at close to 100% so it doesn't seem to be non SQL things taking server resources.

    Are you looking at in Task Manager by logging onto the server?  Why?  A server is not a desktop.  There are few reasons to RDP into a server as a DBA.  Your RDP session may be contributing to the issue.

    SQL Server process in Task Manager should be close to 100% in memory usage.  That's how its designed to work.  If CPU is close to 100%, is that normal?  You are looking at a small slice of time in Task Manager.

    as_1234 wrote:

    The wait_times for sessions 53 and 54 are much higher than they would normally be (the procedures should complete in a few minutes). 

    Wait times from the sys.dm_exec_requests DMV may not correspond to the actual execution time of a procedure.  This is the definition from MS's documentation: If the request is currently blocked, this column returns the duration in milliseconds, of the current wait. Not nullable.

    as_1234 wrote:

    If my understanding is correct CXpacket and cxconsumer are parallelism. Maybe they are waiting cpu cores but then they would be in status runnable?

    CXPacket and CSConsumer wait types indicate that there is parallelism occurring.  They are not typically a problem.  There is a LOT of bad advice about these wait types.

    as_1234 wrote:

    If locks were an issue then I think the sessions would show as being blocked?

    A lock is normal.  All RDMS need to take locks.  The system will not function without locks. "Block" is not one of the statuses you will get from your query.  These are the only statuses in sys.dm_exec_requests:

    background

    rollback

    running

    runnable

    sleeping

    suspended

    Deadlocks are a potential issue. A lock is absolutely required.

    as_1234 wrote:

    My current fix is to stop the SQL agent job and restart it. It then completes OK but I'd like to properly understand what's happening.

    This is not a fix.  By restarting, you may be losing relevant information to diagnose this issue.

    Have you ruled out settings, other processes, and so forth? Is this a virtual machine?  If so, is the VM configured properly?  It appears that you are digging into the weeds when you need to start at a higher level.

    Have you captured the execution plans, and operational stats, for the various executions of the query and compared them?  Have you noted what other things were executing when this was slow? Query Store will help with this.

    I suggest you look at these things:

    Much more detail on wait types

    https://www.sqlskills.com/help/waits/

    This should be installed on all servers.

    http://whoisactive.com/

    A wealth of information about your system from the DMV queries.

    https://glennsqlperformance.com/

    First Responder kit. What does sp_blitz and sp_blitzcache show?

    https://www.brentozar.com/first-aid/

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thank you both for your help.

    ScottPletcher wrote:

    The SELECT INTOs will hold locks.  Those might be causing issues.

    One good way to avoid that is to an initial SELECT TOP (0) ... INTO to create the table, then use INSERT INTO ... SELECT to actually load the table.  You can still get minimal logging, if otherwise available, but you won't hold as many meta locks in SQL Server that way.

    Thank you for the information. If the locks from sessions 53 and 76 were preventing other sessions from progressing then I think they would show as blocked by 53 or 76?

    Michael L John wrote:

    Are you looking at in Task Manager by logging onto the server?  Why?  A server is not a desktop.  There are few reasons to RDP into a server as a DBA.  Your RDP session may be contributing to the issue.

    SQL Server process in Task Manager should be close to 100% in memory usage.  That's how its designed to work.  If CPU is close to 100%, is that normal?  You are looking at a small slice of time in Task Manager.

    I used RDP to view task manager. The CPU was close to 100% and it was all being used by SQL Server. I checked because I wanted to confirm that it wasn't something else, such as an anti virus process, using the CPU.

    Michael L John wrote:

    Wait times from the sys.dm_exec_requests DMV may not correspond to the actual execution time of a procedure.  This is the definition from MS's documentation: If the request is currently blocked, this column returns the duration in milliseconds, of the current wait. Not nullable.

    The wait times showed in the results were much longer then the normal run time of the entire procedure.

    Michael L John wrote:

    This is not a fix.  By restarting, you may be losing relevant information to diagnose this issue.

    I agree. That was the reason for my post. The query I posted isn't telling me what the issue is so I'm trying to work out what else I can do/what I'm missing.

    Michael L John wrote:

    Have you ruled out settings, other processes, and so forth? Is this a virtual machine?  If so, is the VM configured properly?  It appears that you are digging into the weeds when you need to start at a higher level.

    This has all been checked and I can't see any issues.

    Michael L John wrote:

    Have you captured the execution plans, and operational stats, for the various executions of the query and compared them?  Have you noted what other things were executing when this was slow? Query Store will help with this.

    I have looked in query store and I can't see any difference between different a successful run and an one that doesn't complete.

    Michael L John wrote:

    I suggest you look at these things:

    Much more detail on wait types

    https://www.sqlskills.com/help/waits/

    This should be installed on all servers.

    http://whoisactive.com/

    A wealth of information about your system from the DMV queries.

    https://glennsqlperformance.com/

    First Responder kit. What does sp_blitz and sp_blitzcache show?

    https://www.brentozar.com/first-aid/

    Thank you for the links. I will have a look. I hadn't seen the https://glennsqlperformance.com/ page before.

     

    • This reply was modified 1 month ago by  as_1234. Reason: Trying to fix quotes
  • When you say the VM has been checked, can you elaborate?  This really smells like VMWare ballooning resources.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Apologies for taking a couple of days to reply. Those kind of checks are done by our VM admins as I'm not a VM admin.

    I think by ballooning you mean that the physical host server CPU resources are being used by other VMs at the same time therefore the VM I'm using doesn't have enough CPU resource available to it?

    When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?

  • When the process hangs, is the CPU/memory pegged at 100 percent?

    Please check for CPU pressure or memory pressure.

    Pinal Dave has good articles on wait types.

     

     

     

    DBASupport

  • Also, ensure the MIN/MAX SQL memory is assigned properly.

    Server has 32 GB, Min: 8 GB / Max 24 GB (for example)

    With VM servers, sufficient memory is essential.

     

     

     

     

     

    DBASupport

  • as_1234 wrote:

    Apologies for taking a couple of days to reply. Those kind of checks are done by our VM admins as I'm not a VM admin.

    I think by ballooning you mean that the physical host server CPU resources are being used by other VMs at the same time therefore the VM I'm using doesn't have enough CPU resource available to it?

    When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?

    VMWare has very specific recommendations about configuring the VM for SQL Server.   I have had to educate far too many "VM Admins" regarding the configuration of the VM.  I suggest you read the following white paper from VMWare, and try to work with your VM Admins to determine if it is configured properly.

    https://www.vmware.com/docs/architecting-microsoft-sql-server-on-vmware-vsphere

    as_1234 wrote:

    When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?

    When the process starts, it will need more resources and VMWare is probably allocating them.  This is exactly what I would expect.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • cyrusbaratt wrote:

    Also, ensure the MIN/MAX SQL memory is assigned properly.

    Server has 32 GB, Min: 8 GB / Max 24 GB (for example)

    With VM servers, sufficient memory is essential.

    How did you come up with 24 GB allocated to SQL if there is 32 GB allocated to the VM?  That may not be the correct allocation of resources.  There are many factors to consider for this calculation.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply