July 25, 2018 at 5:34 pm
I feel like I've solved this type of problem before, and yet I'm stuck and need some expert help. Here is some DDL so you can see a relevant snippet of what I'm working with:CREATE TABLE #Scratch
(
PayeeID CHAR(3) NOT NULL,
HRAttribute VARCHAR(25) NOT NULL,
ValueText VARCHAR(25) NOT NULL,
StartDate DATE NOT NULL,
)
GO
INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate)
SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL
SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL
SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018'
GO
From there I basically want to identify "islands" of data--that is, contiguous groups of rows where PayeeID, HRAttribute, and ValueText are identical, when sorted by StartDate. Here's the SQL I have so far, which seems to almost be working but not quite:SELECT
*,
RN1 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate),
RN2 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate),
GROUP_NUMBER = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate) - ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate)
FROM #Scratch
ORDER BY StartDate
;
PayeeID HRAttribute ValueText StartDate RN1 RN2 GROUP_NUMBER
-----------------------------------------------------------------------------
123 SUPERVISOR ID 111 2014-09-29 1 1 0
123 SUPERVISOR ID 111 2014-10-07 2 2 0
123 SUPERVISOR ID 111 2015-04-01 3 3 0
123 SUPERVISOR ID 222 2015-09-25 1 4 3
123 SUPERVISOR ID 333 2015-12-22 1 5 4
123 SUPERVISOR ID 333 2016-04-01 2 6 4
123 SUPERVISOR ID 222 2016-10-05 2 7 5
123 SUPERVISOR ID 333 2017-01-09 3 8 5
123 SUPERVISOR ID 333 2017-03-01 4 9 5
123 SUPERVISOR ID 333 2017-10-01 5 10 5
123 SUPERVISOR ID 333 2018-01-01 6 11 5
123 SUPERVISOR ID 333 2018-03-01 7 12 5
123 SUPERVISOR ID 222 2018-06-29 3 13 10
123 SUPERVISOR ID 222 2018-07-12 4 14 10
As you can see, my column GROUP_NUMBER would work fine to identify "islands" if ValueText never reverted to a value it previously held. But in this case, value can change to a new value and then back to a previous value. On 10/05/2016, when this Payee's Supervisor ID changes back to '222', I want RN1 to reset to 1, but instead it continues to 2. I understand why that is, the ROW_NUMBER code is easy for me to comprehend, but I'm not sure what to change to get the desired results. Thoughts?
July 25, 2018 at 9:21 pm
Dang. I've run into this problem before and resolved it but I've misplaced my notes on how I fixed it and so doing this totally from memory, which might not be all that it's cracked up to be. 😀
The following works for what you have and, IIRC, will maintain the temporal order because of the values of R1-R2 but I don't have the time to test all of the scenarios that I tested with before and so I might not be recalling correctly. The ",FinalGrouping" column is what you're after. I left all the other columns in place just so you can see how it works. What people forget is that R1-R2 is a grouping for each (in this case) ValueText and so you need that extra grouping using DENSE_RANK to resolve what looks like a tie but actually isn't.
WITH cte AS
(
SELECT *
,R1 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute ORDER BY StartDate, ValueText)
,R2 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate, ValueText)
FROM #Scratch
)
SELECT *
,R1R2 = R1-R2
,FinalGrouping = DENSE_RANK() OVER (PARTITION BY PayeeID, HRAttribute ORDER BY R1-R2,ValueText)
FROM cte
ORDER BY PayeeID, HRAttribute, StartDate, ValueText
;
Please do test the be-jeebers out of it and forgive me if I've remembered incorrectly.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2018 at 9:47 pm
Awesome, thank you Jeff! I will test this out tomorrow and post back!
July 26, 2018 at 12:11 am
Here is a more efficient way as it eliminates 3 out of 4 sort operators found in the ROW_NUMBER solution's execution plan.
😎
USE TEEST;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'tempdb..#Scratch') IS NOT NULL DROP TABLE #Scratch;
CREATE TABLE #Scratch
(
PayeeID CHAR(3) NOT NULL,
HRAttribute VARCHAR(25) NOT NULL,
ValueText VARCHAR(25) NOT NULL,
StartDate DATE NOT NULL,
)
INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate)
SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL
SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL
SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '444', '10/02/2017' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL
SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018'
;
;WITH GROUPED_EMP AS
(
SELECT
CASE
WHEN SC.ValueText = LAG(SC.ValueText,1,SC.ValueText) OVER
(
PARTITION BY SC.PayeeID
,SC.HRAttribute
ORDER BY SC.[StartDate] ASC
) THEN 0
ELSE 1
END AS GRP_FLG
,SC.PayeeID
,SC.HRAttribute
,SC.ValueText
,SC.StartDate
FROM #Scratch SC
)
SELECT
GE.PayeeID
,GE.HRAttribute
,GE.ValueText
,GE.StartDate
,SUM(GE.GRP_FLG) OVER
(
PARTITION BY GE.PayeeID
,GE.HRAttribute
ORDER BY GE.StartDate ASC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS GRP_NO
FROM GROUPED_EMP GE;
Output
PayeeID HRAttribute ValueText StartDate GRP_NO
------- -------------- ---------- ---------- -------
123 SUPERVISOR ID 111 2014-09-29 0
123 SUPERVISOR ID 111 2014-10-07 0
123 SUPERVISOR ID 111 2015-04-01 0
123 SUPERVISOR ID 222 2015-09-25 1
123 SUPERVISOR ID 333 2015-12-22 2
123 SUPERVISOR ID 333 2016-04-01 2
123 SUPERVISOR ID 222 2016-10-05 3
123 SUPERVISOR ID 333 2017-01-09 4
123 SUPERVISOR ID 333 2017-03-01 4
123 SUPERVISOR ID 333 2017-10-01 4
123 SUPERVISOR ID 444 2017-10-02 5
123 SUPERVISOR ID 333 2018-01-01 6
123 SUPERVISOR ID 333 2018-03-01 6
123 SUPERVISOR ID 222 2018-06-29 7
123 SUPERVISOR ID 222 2018-07-12 7
An added bonus is that it increments the Group Number in chronological order.
July 26, 2018 at 4:58 am
I completely missed that this was posted on a 2012 forum. LAG is definitely the way to go for this. Thanks, Eirikur.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 26, 2018 at 8:47 am
To add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
Adding the following index will completely eliminate the need for that sort.
CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch (
PayeeID,
HRAttribute,
StartDate
)
INCLUDE (
ValueText
);
July 26, 2018 at 9:14 am
Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.
Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.
July 26, 2018 at 9:16 am
Jason A. Long - Thursday, July 26, 2018 8:47 AMTo add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
Adding the following index will completely eliminate the need for that sort.
CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch (
PayeeID,
HRAttribute,
StartDate
)
INCLUDE (
ValueText
);
Thanks Jason, good point, thought of adding the Partition Order Covering (POC) index addition but ran out of time this morning.
😎
July 26, 2018 at 9:37 am
autoexcrement - Thursday, July 26, 2018 9:14 AMThanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.
You are very welcome.
😎
I use the method of "results to text" in SSMS, chop off the trailing things with columnar editing (the ALT KEY and mouse select), paste it into a Plain IF code section and finally format it in a fixed width font. Not perfect but serves the purpose.
My annoyance is that there is no "PRE" type IF code which would make it better.
July 26, 2018 at 10:16 am
Wow, never used results to text, and never knew about "alt + select"! Thanks. Still, getting any semblance of clean formatting in my post took ages. Hope it's helpful to someone. 🙂
July 26, 2018 at 11:05 am
autoexcrement - Thursday, July 26, 2018 9:14 AMThanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips.Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.
It's tough to make specific reccomendations without knowing what the source table actually looks like.or how this fits into the rest of the query.
Basically there are two options... #1 keep it as is. #2 is to dump the UNPIVOT results into a #temp table and index that.
There are positives and negatives to both...
Using a temp table has the advantages of simplifying the orerall query by breaking the unpivot and gaps & islands into separate commands. It also allows SQL Server to create stastics for the #temp table and it would allow you to create the afformentioned POC index.
The downside, is that creating temp tables will cause a recompolition when placed in a stored procedure . The impact of this type of recompolition can usually be negated by simply moving all DDL statements to the very beginning of the procedure.
Also, it's entirely possible that the time needed create the POC index may completly negate the benefite of having in place for the gaps & islands command.
My gut says that the upsides of using a #temp table would outweight the downsides but the only way to know for sure is to write it both ways and see which one performs better.
July 26, 2018 at 12:15 pm
I came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.
July 26, 2018 at 1:50 pm
autoexcrement - Thursday, July 26, 2018 12:15 PMI came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.
Nice. Thanks for the feedback.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply