December 2, 2024 at 5:13 pm
Hello,
We have a SQL agent job that runs daily. It runs a procedure, which for the purposes of this post, I will refer to as 'Procedure A'. That procedure runs other procedures.
Every few months we get an issue where it appears to have stalled. Task manager is showing the SQL Server process at close to 100% so it doesn't seem to be non SQL things taking server resources.
I'm struggling to find the cause and would appreciate some suggestions on what to do next time to try and find the cause or what concept I've not understand and need to go and learn about (I have done some research to try and work it out myself but not found the answer).
I use the following query to try and establish what's happening:
SELECT er.session_id,es.host_name,es.program_name, login_name, er.status,command, er.blocking_session_id,er.wait_type,er.wait_time,
DB_NAME(er.database_id) AS db_name,
text, query_plan
FROM sys.dm_exec_sessions es
INNER join sys.dm_exec_requests er ON es.session_id=er.session_id
outer APPLY sys.dm_exec_sql_text(er.sql_handle)
outer APPLY sys.dm_exec_query_plan(er.plan_handle)
WHERE er.session_id<>@@SPID
AND host_name IS NOT null
This is the relevant columns from the results:
The wait_type for session 52 makes sense, it's waiting for the other sessions.
Session 77 - That doesn't appear to be waiting for anything but it had been running for hours when it should be minutes.
I suspect session 77 is the root cause of the issue but I can't prove it. I wonder if it's taking all of the cpu cores but I've not found a way to prove that?
I thought about the possibility of a bad execution plan (the data changes every day) but if I look at the plan it looks the same as usual so I don't think it's that but I could be wrong.
The wait_times for sessions 53 and 54 are much higher than they would normally be (the procedures should complete in a few minutes). If my understanding is correct CXpacket and cxconsumer are parallelism. Maybe they are waiting cpu cores but then they would be in status runnable?
If locks were an issue then I think the sessions would show as being blocked?
Or possibly my query could be excluding the real cause. I have tried doing things like just looking at the tables individually and not found anything that looks relevant. Or possibly I'm taking completely the wrong approach?
My current fix is to stop the SQL agent job and restart it. It then completes OK but I'd like to properly understand what's happening.
Thanks
December 2, 2024 at 6:41 pm
The SELECT INTOs will hold locks. Those might be causing issues.
One good way to avoid that is to an initial SELECT TOP (0) ... INTO to create the table, then use INSERT INTO ... SELECT to actually load the table. You can still get minimal logging, if otherwise available, but you won't hold as many meta locks in SQL Server that way.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 2, 2024 at 10:11 pm
There are a couple things in your post that may need more clarification:
Task manager is showing the SQL Server process at close to 100% so it doesn't seem to be non SQL things taking server resources.
Are you looking at in Task Manager by logging onto the server? Why? A server is not a desktop. There are few reasons to RDP into a server as a DBA. Your RDP session may be contributing to the issue.
SQL Server process in Task Manager should be close to 100% in memory usage. That's how its designed to work. If CPU is close to 100%, is that normal? You are looking at a small slice of time in Task Manager.
The wait_times for sessions 53 and 54 are much higher than they would normally be (the procedures should complete in a few minutes).
Wait times from the sys.dm_exec_requests DMV may not correspond to the actual execution time of a procedure. This is the definition from MS's documentation: If the request is currently blocked, this column returns the duration in milliseconds, of the current wait. Not nullable.
If my understanding is correct CXpacket and cxconsumer are parallelism. Maybe they are waiting cpu cores but then they would be in status runnable?
CXPacket and CSConsumer wait types indicate that there is parallelism occurring. They are not typically a problem. There is a LOT of bad advice about these wait types.
If locks were an issue then I think the sessions would show as being blocked?
A lock is normal. All RDMS need to take locks. The system will not function without locks. "Block" is not one of the statuses you will get from your query. These are the only statuses in sys.dm_exec_requests:
background
rollback
running
runnable
sleeping
suspended
Deadlocks are a potential issue. A lock is absolutely required.
My current fix is to stop the SQL agent job and restart it. It then completes OK but I'd like to properly understand what's happening.
This is not a fix. By restarting, you may be losing relevant information to diagnose this issue.
Have you ruled out settings, other processes, and so forth? Is this a virtual machine? If so, is the VM configured properly? It appears that you are digging into the weeds when you need to start at a higher level.
Have you captured the execution plans, and operational stats, for the various executions of the query and compared them? Have you noted what other things were executing when this was slow? Query Store will help with this.
I suggest you look at these things:
Much more detail on wait types
https://www.sqlskills.com/help/waits/
This should be installed on all servers.
A wealth of information about your system from the DMV queries.
https://glennsqlperformance.com/
First Responder kit. What does sp_blitz and sp_blitzcache show?
https://www.brentozar.com/first-aid/
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
December 3, 2024 at 2:20 pm
Thank you both for your help.
The SELECT INTOs will hold locks. Those might be causing issues.
One good way to avoid that is to an initial SELECT TOP (0) ... INTO to create the table, then use INSERT INTO ... SELECT to actually load the table. You can still get minimal logging, if otherwise available, but you won't hold as many meta locks in SQL Server that way.
Thank you for the information. If the locks from sessions 53 and 76 were preventing other sessions from progressing then I think they would show as blocked by 53 or 76?
Are you looking at in Task Manager by logging onto the server? Why? A server is not a desktop. There are few reasons to RDP into a server as a DBA. Your RDP session may be contributing to the issue.
SQL Server process in Task Manager should be close to 100% in memory usage. That's how its designed to work. If CPU is close to 100%, is that normal? You are looking at a small slice of time in Task Manager.
I used RDP to view task manager. The CPU was close to 100% and it was all being used by SQL Server. I checked because I wanted to confirm that it wasn't something else, such as an anti virus process, using the CPU.
Wait times from the sys.dm_exec_requests DMV may not correspond to the actual execution time of a procedure. This is the definition from MS's documentation: If the request is currently blocked, this column returns the duration in milliseconds, of the current wait. Not nullable.
The wait times showed in the results were much longer then the normal run time of the entire procedure.
This is not a fix. By restarting, you may be losing relevant information to diagnose this issue.
I agree. That was the reason for my post. The query I posted isn't telling me what the issue is so I'm trying to work out what else I can do/what I'm missing.
Have you ruled out settings, other processes, and so forth? Is this a virtual machine? If so, is the VM configured properly? It appears that you are digging into the weeds when you need to start at a higher level.
This has all been checked and I can't see any issues.
Have you captured the execution plans, and operational stats, for the various executions of the query and compared them? Have you noted what other things were executing when this was slow? Query Store will help with this.
I have looked in query store and I can't see any difference between different a successful run and an one that doesn't complete.
I suggest you look at these things:
Much more detail on wait types
https://www.sqlskills.com/help/waits/
This should be installed on all servers.
A wealth of information about your system from the DMV queries.
https://glennsqlperformance.com/
First Responder kit. What does sp_blitz and sp_blitzcache show?
Thank you for the links. I will have a look. I hadn't seen the https://glennsqlperformance.com/ page before.
December 3, 2024 at 3:13 pm
When you say the VM has been checked, can you elaborate? This really smells like VMWare ballooning resources.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
December 5, 2024 at 8:14 pm
Apologies for taking a couple of days to reply. Those kind of checks are done by our VM admins as I'm not a VM admin.
I think by ballooning you mean that the physical host server CPU resources are being used by other VMs at the same time therefore the VM I'm using doesn't have enough CPU resource available to it?
When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?
January 2, 2025 at 3:03 am
When the process hangs, is the CPU/memory pegged at 100 percent?
Please check for CPU pressure or memory pressure.
Pinal Dave has good articles on wait types.
DBASupport
January 2, 2025 at 3:08 am
Also, ensure the MIN/MAX SQL memory is assigned properly.
Server has 32 GB, Min: 8 GB / Max 24 GB (for example)
With VM servers, sufficient memory is essential.
DBASupport
January 2, 2025 at 2:48 pm
Apologies for taking a couple of days to reply. Those kind of checks are done by our VM admins as I'm not a VM admin.
I think by ballooning you mean that the physical host server CPU resources are being used by other VMs at the same time therefore the VM I'm using doesn't have enough CPU resource available to it?
When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?
VMWare has very specific recommendations about configuring the VM for SQL Server. I have had to educate far too many "VM Admins" regarding the configuration of the VM. I suggest you read the following white paper from VMWare, and try to work with your VM Admins to determine if it is configured properly.
https://www.vmware.com/docs/architecting-microsoft-sql-server-on-vmware-vsphere
When I restart our process it immediately starts working OK. That would make it unlikely to be ballooning as the behaviour of the host/other VMs wouldn't be affected by a process restarting on one VM?
When the process starts, it will need more resources and VMWare is probably allocating them. This is exactly what I would expect.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
January 2, 2025 at 2:50 pm
Also, ensure the MIN/MAX SQL memory is assigned properly.
Server has 32 GB, Min: 8 GB / Max 24 GB (for example)
With VM servers, sufficient memory is essential.
How did you come up with 24 GB allocated to SQL if there is 32 GB allocated to the VM? That may not be the correct allocation of resources. There are many factors to consider for this calculation.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply