Islands And Gaps - I'm Stuck...Again

Question

Islands And Gaps - I'm Stuck...Again

autoexcrement

SSCertifiable

Points: 5956
More actions
July 25, 2018 at 5:34 pm

#354946

I feel like I've solved this type of problem before, and yet I'm stuck and need some expert help. Here is some DDL so you can see a relevant snippet of what I'm working with:
CREATE TABLE #Scratch (     PayeeID CHAR(3) NOT NULL,     HRAttribute VARCHAR(25) NOT NULL,     ValueText VARCHAR(25) NOT NULL,     StartDate DATE NOT NULL, ) GO
INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate) SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018' GO
From there I basically want to identify "islands" of data--that is, contiguous groups of rows where PayeeID, HRAttribute, and ValueText are identical, when sorted by StartDate. Here's the SQL I have so far, which seems to almost be working but not quite:
SELECT     *,     RN1 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate),     RN2 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate),     GROUP_NUMBER = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate) - ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate) FROM #Scratch ORDER BY StartDate ;
PayeeID HRAttribute ValueText StartDate RN1      RN2      GROUP_NUMBER ----------------------------------------------------------------------------- 123 SUPERVISOR ID 111 2014-09-29 1       1       0 123 SUPERVISOR ID 111 2014-10-07 2       2       0 123 SUPERVISOR ID 111 2015-04-01 3       3       0 123 SUPERVISOR ID 222 2015-09-25 1       4       3 123 SUPERVISOR ID 333 2015-12-22 1       5       4 123 SUPERVISOR ID 333 2016-04-01 2       6       4 123 SUPERVISOR ID 222 2016-10-05 2       7       5 123 SUPERVISOR ID 333 2017-01-09 3       8       5 123 SUPERVISOR ID 333 2017-03-01 4       9       5 123 SUPERVISOR ID 333 2017-10-01 5       10       5 123 SUPERVISOR ID 333 2018-01-01 6       11       5 123 SUPERVISOR ID 333 2018-03-01 7       12       5 123 SUPERVISOR ID 222 2018-06-29 3       13       10 123 SUPERVISOR ID 222 2018-07-12 4       14       10
As you can see, my column GROUP_NUMBER would work fine to identify "islands" if ValueText never reverted to a value it previously held. But in this case, value can change to a new value and then back to a previous value. On 10/05/2016, when this Payee's Supervisor ID changes back to '222', I want RN1 to reset to 1, but instead it continues to 2. I understand why that is, the ROW_NUMBER code is easy for me to comprehend, but I'm not sure what to change to get the desired results. Thoughts?
"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

Dang. I've run into this problem before and resolved it but I've misplaced my notes on how I fixed it and so doing this totally from memory, which might not be all that it's cracked up to be. 😀

The following works for what you have and, IIRC, will maintain the temporal order because of the values of R1-R2 but I don't have the time to test all of the scenarios that I tested with before and so I might not be recalling correctly. The ",FinalGrouping" column is what you're after. I left all the other columns in place just so you can see how it works. What people forget is that R1-R2 is a grouping for each (in this case) ValueText and so you need that extra grouping using DENSE_RANK to resolve what looks like a tie but actually isn't.

WITH cte AS ( SELECT * ,R1 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute ORDER BY StartDate, ValueText) ,R2 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate, ValueText) FROM #Scratch ) SELECT * ,R1R2 = R1-R2 ,FinalGrouping = DENSE_RANK() OVER (PARTITION BY PayeeID, HRAttribute ORDER BY R1-R2,ValueText) FROM cte ORDER BY PayeeID, HRAttribute, StartDate, ValueText ;

Please do test the be-jeebers out of it and forgive me if I've remembered incorrectly.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

autoexcrement SSCertifiable Points: 5956 More actions · Answer 2

Awesome, thank you Jeff! I will test this out tomorrow and post back!

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 3

Here is a more efficient way as it eliminates 3 out of 4 sort operators found in the ROW_NUMBER solution's execution plan.
😎
USE TEEST; GO SET NOCOUNT ON;

IF OBJECT_ID(N'tempdb..#Scratch') IS NOT NULL DROP TABLE #Scratch; CREATE TABLE #Scratch ( PayeeID CHAR(3) NOT NULL, HRAttribute VARCHAR(25) NOT NULL, ValueText VARCHAR(25) NOT NULL, StartDate DATE NOT NULL, )

INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate) SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '444', '10/02/2017' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018' ; ;WITH GROUPED_EMP AS ( SELECT CASE WHEN SC.ValueText = LAG(SC.ValueText,1,SC.ValueText) OVER ( PARTITION BY SC.PayeeID ,SC.HRAttribute ORDER BY SC.[StartDate] ASC ) THEN 0 ELSE 1 END AS GRP_FLG ,SC.PayeeID ,SC.HRAttribute ,SC.ValueText ,SC.StartDate FROM #Scratch SC ) SELECT GE.PayeeID ,GE.HRAttribute ,GE.ValueText ,GE.StartDate ,SUM(GE.GRP_FLG) OVER ( PARTITION BY GE.PayeeID ,GE.HRAttribute ORDER BY GE.StartDate ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS GRP_NO FROM GROUPED_EMP GE;

Output
PayeeID HRAttribute ValueText StartDate GRP_NO ------- -------------- ---------- ---------- ------- 123 SUPERVISOR ID 111 2014-09-29 0 123 SUPERVISOR ID 111 2014-10-07 0 123 SUPERVISOR ID 111 2015-04-01 0 123 SUPERVISOR ID 222 2015-09-25 1 123 SUPERVISOR ID 333 2015-12-22 2 123 SUPERVISOR ID 333 2016-04-01 2 123 SUPERVISOR ID 222 2016-10-05 3 123 SUPERVISOR ID 333 2017-01-09 4 123 SUPERVISOR ID 333 2017-03-01 4 123 SUPERVISOR ID 333 2017-10-01 4 123 SUPERVISOR ID 444 2017-10-02 5 123 SUPERVISOR ID 333 2018-01-01 6 123 SUPERVISOR ID 333 2018-03-01 6 123 SUPERVISOR ID 222 2018-06-29 7 123 SUPERVISOR ID 222 2018-07-12 7

An added bonus is that it increments the Group Number in chronological order.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 4

I completely missed that this was posted on a 2012 forum. LAG is definitely the way to go for this. Thanks, Eirikur.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 5

To add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
Adding the following index will completely eliminate the need for that sort.

CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch ( PayeeID, HRAttribute, StartDate ) INCLUDE ( ValueText );

autoexcrement SSCertifiable Points: 5956 More actions · Answer 6

Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.

Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 7

Jason A. Long - Thursday, July 26, 2018 8:47 AM
To add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
Adding the following index will completely eliminate the need for that sort.
CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch ( PayeeID, HRAttribute, StartDate ) INCLUDE ( ValueText );

Thanks Jason, good point, thought of adding the Partition Order Covering (POC) index addition but ran out of time this morning.
😎

Here is Brent Ozar's brilliant write-up on the subject.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 8

autoexcrement - Thursday, July 26, 2018 9:14 AM
Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.
Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.

You are very welcome.
😎

I use the method of "results to text" in SSMS, chop off the trailing things with columnar editing (the ALT KEY and mouse select), paste it into a Plain IF code section and finally format it in a fixed width font. Not perfect but serves the purpose.
My annoyance is that there is no "PRE" type IF code which would make it better.

autoexcrement SSCertifiable Points: 5956 More actions · Answer 9

Wow, never used results to text, and never knew about "alt + select"! Thanks. Still, getting any semblance of clean formatting in my post took ages. Hope it's helpful to someone. 🙂

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 10

autoexcrement - Thursday, July 26, 2018 9:14 AM
Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.
Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.

It's tough to make specific reccomendations without knowing what the source table actually looks like.or how this fits into the rest of the query.
Basically there are two options... #1 keep it as is. #2 is to dump the UNPIVOT results into a #temp table and index that.
There are positives and negatives to both...
Using a temp table has the advantages of simplifying the orerall query by breaking the unpivot and gaps & islands into separate commands. It also allows SQL Server to create stastics for the #temp table and it would allow you to create the afformentioned POC index.
The downside, is that creating temp tables will cause a recompolition when placed in a stored procedure . The impact of this type of recompolition can usually be negated by simply moving all DDL statements to the very beginning of the procedure.
Also, it's entirely possible that the time needed create the POC index may completly negate the benefite of having in place for the gaps & islands command.

My gut says that the upsides of using a #temp table would outweight the downsides but the only way to know for sure is to write it both ways and see which one performs better.

autoexcrement SSCertifiable Points: 5956 More actions · Answer 11

I came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 12

autoexcrement - Thursday, July 26, 2018 12:15 PM
I came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.

Nice. Thanks for the feedback.