Stuck Parallel Query

Question

Stuck Parallel Query

Jim_K

SSCrazy

Points: 2418
More actions
December 11, 2018 at 3:00 pm

#381023

I've never seen this before, and I have no idea what to do (other then kill the SPID).
Third party vendor code running on SQL 2014 SP2 generates a slowness complaint and the vendor is engaged to help understand whats going on. The workload is an ETL and they are transforming some data and the query goes parallel with a DOP of 8, but stalls and no progress happens for hours. They ask me to take a look and I do some digging and this is what I find:
session_id    scheduler_id    worker_address    context_switch_count    task_state    wait_type    wait_duration_ms
165    0    0x000000375BF28160    6793019    RUNNING    NULL    NULL
165    0    0x000000375514E160    443453    SUSPENDED    CXPACKET    12692158
165    1    0x0000002BF5AC8160    69815    SUSPENDED    CXPACKET    12692158
165    2    0x00000028B24D6160    48514    SUSPENDED    CXPACKET    12692158
165    3    0x0000001E97970160    799102    SUSPENDED    CXPACKET    12692158
165    4    0x000000065A910160    74906    SUSPENDED    CXPACKET    12692158
165    5    0x000000380650E160    342172    SUSPENDED    CXPACKET    12692358
165    5    0x000000314495E160    5047486    SUSPENDED    CXPACKET    12692158
165    6    0x0000002EAD0E6160    352198    SUSPENDED    CXPACKET    12692158
165    7    0x000000199D466160    106761    SUSPENDED    CXPACKET    12692158
The documented wait type for the running worker is SOS_SCHEDULER_YIELD which to me means its using up its quantum and going back to the runnable list. It really seems that its deadlocked or something but SQL isn't sensing it? I checked memory grants and everything seems ok. Nothing waiting in sys.dm_io_pending_io_requests and the SPID is just accumulating CXPACKET waits.
Thoughts?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 1

Not nearly enough information. My recommendation is to download a copy of sp_WhoIsActive by Adam Machanic (Google for it, please) and find out what is running (it'll show the actual code) when this happens. From the stall I see, I suspect an "illegal update" but there's no way of knowing without being able to see the code that's running.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jim_K SSCrazy Points: 2418 More actions · Answer 2

The query is an insert that is part of a stored procedure. They are using a TABLOCK hint on the table being written to as well.
(table and some column names have been changed). The only I've noticed is it appears they've duplicated a condition in their inner join in the where clause as well.

INSERT
    INTO dbo.TEMP WITH (TABLOCK)
SELECT
    T1.GROUP,
    T1.INVOICE,
    NULL AS LAST_TRANSACTION_ID,
    NULL AS LAST_TXN_NUM,
    NULL AS LAST_CODE
FROM
    dbo.TEMP_WORK T1
INNER JOIN
    TEMP_WORK T2
    ON    T2.GROUP = T1.GROUP AND
        T2.INVOICE = T1.INVOICE AND
        T2.CODE <> T1.CODE AND
        T2.POST_PERIOD = T1.POST_PERIOD
WHERE
    T1.CODE <> T2.CODE AND
    T1.CUR_AR <> 0 AND
    T2.CUR_AR <> 0
GROUP BY
    T1.GROUP,
    T1.INVOICE

The other detail I should have mentioned was that there were no IOs accumulating -- nothing was being read by the parallel worker threads and nothing was being written either. There were pending IOs on one of the parallel threads, however I checked the pending IO dmv there wasn't anything listed. I find that a bit weird.

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 3

It appears to me that this is an attempt at the "simple" deduplication of data (Temp_Work is self joined) combined with a check for valid data (the <> 0 stuff). That leads me to believe that a simple DISTINCT or partitioned ROW_NUMBER() would do the trick quite nicely here. I'm looking deeper.

In the meantime, can you confirm that's what the intent of this code is?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 4

Adding a bit of a correction, it looks like the dupe-check goes a bit further. It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period. Is that also what was intended by this code?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jim_K SSCrazy Points: 2418 More actions · Answer 5

Jeff Moden - Wednesday, December 12, 2018 7:09 PM
Adding a bit of a correction, it looks like the dupe-check goes a bit further. It also looks like the only time a particular group/invoice should be returned is if there's more than one distinct code for any given group/invoice/post_period. Is that also what was intended by this code?

I'm not sure exactly what they are trying to do here, but while doing some testing around this (applying a maxdop of 1) they realized that there was duplicate data upstream of them that had caused an explosion in the number of records.

Not sure if it is actually the true root cause of the query hanging and not make any progress, but it certainly could explain the initial slowness complaint. Since I wasn't familiar with the data itself I didn't pick up on it.

Thanks for your help, and hopefully this will also lead to some clean up of some of the more inefficient queries that were found.

Jim

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 6

No problem. Thanks for the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Stuck Parallel Query

Cookies on SQLServerCentral