In part 1 I explained how I was hunting for the root cause for some query timeouts that happened every once in a while. I knew for certain, that the affected queries were waiting for 30 seconds for an intent exclusive lock on something, before a timeout occurs.
Our good old friend sys.sysprocesses actually gives us this kind of information. Let me use the same setup as I did in part 1 to trigger a blocking scenario. From connection 1 I’m execution the following, which leaves an open transaction that holds a lock on the DummyTable:
CREATE TABLE DummyTable (col1 INT) INSERT INTO DummyTable (col1) VALUES (1) GO BEGIN TRAN GO DELETE FROM DummyTable WAITFOR DELAY '00:01:00' GO
From connection 2 I’m executing the following, which will keep running until connection 1 either performs a commit or a rollback:
DELETE FROM DummyTable GO
From a third connection I’m running this query agains sys.sysprocesses:
SELECT procs.spid, procs.blocked as BlockingSPID, DB_NAME(procs.dbid) as DatabaseName, procs.program_name, procs.loginame, procs.waitresource, procs.lastwaittype FROM sys.sysprocesses procs WHERE spid IN (54,55)
SPID = 54 is my connection 1, and SPID = 55 is my second connection.
The output shows that that my connection 2 (spid 55) is blocked by spid = 54, while waiting to acquire a LCK_M_U on the resource RID:6:1:169:0. This is pretty usefull information, if we could just get the sql text returned as well. And luckily we can
We can extend the query to OUTER APPLY sys.dm_exec_sql_text. This will give us the text of the queries, so lets see how that goes:
SELECT procs.spid, procs.blocked as BlockingSPID, DB_NAME(procs.dbid) as DatabaseName, procs.program_name, procs.loginame, procs.waitresource, procs.lastwaittype, text FROM sys.sysprocesses procs OUTER APPLY sys.dm_exec_sql_text(sql_handle) WHERE spid IN (54,55)
Now we are actually able to see the exact query that is causing the blocking.
Because I knew that the blocking was going on for at least 30 seconds, I simply needed to sample this data every 15. seconds, untill I had cought my sinner. I simply create a temporary table to hold the data, and the created a loop that would run until I stopped it manually. By creating a recursive query, I could also find the blocking chain – so the final query looked like this:
CREATE TABLE #Blocking ( Timestamp DATETIME, SPID SMALLINT, BlockingSPID INT, DatabaseName VARCHAR(255), LoginName VARCHAR(255), ProgramName VARCHAR(255), BlockingStatement VARCHAR(MAX), Waitresource VARCHAR(500), lastwaittype VARCHAR(500), RowNo BIGINT, LevelRow INT ) --Run until manually stopped WHILE (1=1) BEGIN ;WITH Processes AS ( SELECT procs.spid, procs.blocked AS BlockingSPID, DB_NAME(procs.dbid) AS DatabaseName, procs.program_name AS ProgramName, procs.loginame AS LoginName, procs.waitresource, procs.lastwaittype, CAST(text AS VARCHAR(MAX)) AS Text FROM sys.sysprocesses procs CROSS APPLY sys.dm_exec_sql_text (sql_handle) WHERE procs.spid > 50 ), Blocking AS ( --Define base of recursive query SELECT procs.SPID, procs.BlockingSPID, procs.DatabaseName, procs.LoginName, procs.ProgramName, procs.Text, procs.Waitresource, procs.lastwaittype, ROW_NUMBER() OVER(ORDER BY procs.SPID) as RowNo, 0 AS LevelRow FROM Processes procs JOIN Processes procs2 ON procs.SPID = procs2.BlockingSPID WHERE procs.BlockingSPID = 0 UNION ALL --UNION the recursive step SELECT ProcsR.SPID, ProcsR.BlockingSPID, ProcsR.DatabaseName, ProcsR.LoginName, ProcsR.ProgramName, ProcsR.Text, ProcsR.Waitresource, ProcsR.lastwaittype, d.RowNo, d.LevelRow + 1 FROM Processes ProcsR JOIN Blocking d ON ProcsR.BlockingSPID = d.SPID WHERE ProcsR.BlockingSPID > 0 ) INSERT INTO #Blocking SELECT GETDATE() as Timestamp, * FROM Blocking --Wait for 15 seconds before starting over WAITFOR DELAY '00:00:15' END
This query is using multiple leves of CTE’s, with the first level named Processes beeing pretty much identical to the query showed before. The next level CTE called Blocking is used to create the recursive structure making it possible to identify the blocking chain. My connection 1 and 2 were still running, so running the above query for just a few seconds, samples the data I want:
--After stopping the sampling manually: SELECT * FROM #Blocking ORDER BY Timestamp, RowNo, LevelRow
Each sample will have the same values in the Timestamp column. For each sampling each blocking chain will have the same value in RowNo, and LevelRow orders the statements in the blocking chain.
In my example, I only ran the sampling query long enough to get one sample. And the only blocking happening on the system was my two connections, so only one blocking chain was captured… the one with RowNo = 1.
LevelRow = 0 is the root of the blocking chain, and in my example this is my connection 1.
So by running the above query on the production system, and yet again waiting for the timeouts to occur, I was actually able to identify the root cause for all the timeouts I was debugging. The blocking query was holding a table lock for a long time, which perfectly explained what was happening. Fixing the root cause is another story, which is not in scope of this post.
Wrap up
A system was experiencing intermittent query timeouts. My debugging process was:
- Use a classic server side trace to catch the queries that timed out, to figure out if they were using up resources for more than 30 seconds, or if they were waiting for something.
- It turned out they were waiting for something, so I ran a XEvent session to track all wait events with a duration > 200 ms (part 1 of this blog post series)
- The XEvent session revealed that the queries were waiting to acquire an intent exclusive lock, so the next step was to sample blocking chains to find the root query (This blog post)
- The blocking sampling revealed the query that caused heavy table locks, and now it was easy to go ahead and fixing the problem.
But remember..
The debugging process will not work for hunting down the root cause for all scenarios. The outcome of each step determines what the next step should be. So if my initial server side trace had shown heavy cpu usage for the entire 30 seconds, then my next step would be to look into execution plans etc. So in that case it wouldn’t make sense to use the XEvent sessions to capture wait_info and continue down that path.
Nevertheless I hope this little blog post series gave you an idea how you could hunt down your own nasty queries