December 11, 2018 at 3:00 pm
I've never seen this before, and I have no idea what to do (other then kill the SPID).
Third party vendor code running on SQL 2014 SP2 generates a slowness complaint and the vendor is engaged to help understand whats going on. The workload is an ETL and they are transforming some data and the query goes parallel with a DOP of 8, but stalls and no progress happens for hours. They ask me to take a look and I do some digging and this is what I find:
session_id scheduler_id worker_address context_switch_count task_state wait_type wait_duration_ms
165 0 0x000000375BF28160 6793019 RUNNING NULL NULL
165 0 0x000000375514E160 443453 SUSPENDED CXPACKET 12692158
165 1 0x0000002BF5AC8160 69815 SUSPENDED CXPACKET 12692158
165 2 0x00000028B24D6160 48514 SUSPENDED CXPACKET 12692158
165 3 0x0000001E97970160 799102 SUSPENDED CXPACKET 12692158
165 4 0x000000065A910160 74906 SUSPENDED CXPACKET 12692158
165 5 0x000000380650E160 342172 SUSPENDED CXPACKET 12692358
165 5 0x000000314495E160 5047486 SUSPENDED CXPACKET 12692158
165 6 0x0000002EAD0E6160 352198 SUSPENDED CXPACKET 12692158
165 7 0x000000199D466160 106761 SUSPENDED CXPACKET 12692158
The documented wait type for the running worker is SOS_SCHEDULER_YIELD which to me means its using up its quantum and going back to the runnable list. It really seems that its deadlocked or something but SQL isn't sensing it? I checked memory grants and everything seems ok. Nothing waiting in sys.dm_io_pending_io_requests and the SPID is just accumulating CXPACKET waits.
Thoughts?
December 11, 2018 at 5:24 pm
Not nearly enough information. My recommendation is to download a copy of sp_WhoIsActive by Adam Machanic (Google for it, please) and find out what is running (it'll show the actual code) when this happens. From the stall I see, I suspect an "illegal update" but there's no way of knowing without being able to see the code that's running.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2018 at 7:18 am
The query is an insert that is part of a stored procedure. They are using a TABLOCK hint on the table being written to as well.
(table and some column names have been changed). The only I've noticed is it appears they've duplicated a condition in their inner join in the where clause as well.
INSERT
INTO dbo.TEMP WITH (TABLOCK)
SELECT
T1.GROUP,
T1.INVOICE,
NULL AS LAST_TRANSACTION_ID,
NULL AS LAST_TXN_NUM,
NULL AS LAST_CODE
FROM
dbo.TEMP_WORK T1
INNER JOIN
TEMP_WORK T2
ON T2.GROUP = T1.GROUP AND
T2.INVOICE = T1.INVOICE AND
T2.CODE <> T1.CODE AND
T2.POST_PERIOD = T1.POST_PERIOD
WHERE
T1.CODE <> T2.CODE AND
T1.CUR_AR <> 0 AND
T2.CUR_AR <> 0
GROUP BY
T1.GROUP,
T1.INVOICE
The other detail I should have mentioned was that there were no IOs accumulating -- nothing was being read by the parallel worker threads and nothing was being written either. There were pending IOs on one of the parallel threads, however I checked the pending IO dmv there wasn't anything listed. I find that a bit weird.
December 12, 2018 at 6:24 pm
It appears to me that this is an attempt at the "simple" deduplication of data (Temp_Work is self joined) combined with a check for valid data (the <> 0 stuff). That leads me to believe that a simple DISTINCT or partitioned ROW_NUMBER() would do the trick quite nicely here. I'm looking deeper.
In the meantime, can you confirm that's what the intent of this code is?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2018 at 7:09 pm
Adding a bit of a correction, it looks like the dupe-check goes a bit further. It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period. Is that also what was intended by this code?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 14, 2018 at 10:05 am
Jeff Moden - Wednesday, December 12, 2018 7:09 PMAdding a bit of a correction, it looks like the dupe-check goes a bit further. It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period. Is that also what was intended by this code?
I'm not sure exactly what they are trying to do here, but while doing some testing around this (applying a maxdop of 1) they realized that there was duplicate data upstream of them that had caused an explosion in the number of records.
Not sure if it is actually the true root cause of the query hanging and not make any progress, but it certainly could explain the initial slowness complaint. Since I wasn't familiar with the data itself I didn't pick up on it.
Thanks for your help, and hopefully this will also lead to some clean up of some of the more inefficient queries that were found.
Jim
December 15, 2018 at 10:01 pm
No problem. Thanks for the feedback.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply