Is there another way to solve this "group in 5 day spans" problem?

Question

Is there another way to solve this "group in 5 day spans" problem?

Jeff Moden

SSC Guru

Points: 1004410
More actions
December 26, 2022 at 2:49 am

#4130131
I ran into an interesting problem over on StackOverflow at the following link...
https://stackoverflow.com/questions/47388557/group-by-a-set-of-data-and-order-items-in-each-part-separately-from-other-parts
To summarize, here's the data the op posted with 1 row removed because removing that row broke the "Accepted" solution.
```
   DROP TABLE IF EXISTS #MyHead;
GO
 SELECT *
   INTO #MyHead
   FROM (VALUES
         (1  ,'2017-11-24 09:45:00.000') --These should be in Grp #1.
        ,(3  ,'2017-11-23 09:43:00.000') --The 24th thru the 20th is a 5 day span
        ,(4  ,'2017-11-22 09:40:00.000')
        ,(5  ,'2017-11-21 09:39:00.000')
        ,(12 ,'2017-11-20 09:49:00.000')
        ,(9  ,'2017-11-20 09:37:00.000')
                                         --<--<< Notice that there are 2 days in this gap.
        ,(7  ,'2017-11-17 09:38:00.000') --Group 2 starts here, as does a new 
        ,(2  ,'2017-11-15 09:41:00.000') --span start date. Even though there are
        ,(0  ,'2017-11-14 09:44:00.000') --less than 5 entries, the span goes from
        ,(15 ,'2017-11-13 09:47:00.000') --the 17 to the 13th, which is 5 days
                                         --<--<< Notice that this actually isn't a gap.
        ,(7  ,'2017-11-12 09:46:00.000') --Group 3 starts here, as does a new 
        ,(20 ,'2017-11-11 09:42:00.000') --span start date
        ,(4  ,'2017-11-10 09:28:00.000')
        )v(Score,ActivationTime)
;
```
That also contains an explanation of what the OP wants the code to do. Basically, start at the latest date and include all rows that are within the 5 days (inclusive of that first date). Then, starting with the next row (which is a couple of days earlier), use that date as a new starting date for the next 5 day group. Notice that for the 3rd group (following the same rules), there are no missed days.
Here's the output from the "Quirky Update" method that I used to solve it.
The OP didn't want the "Base Date" column. I added that just to see what I was doing. The OP DOES want the Grp column. Also notice that within each 5 day group, it's sorted in order by the score and not the date. I know... that makes no sense to me either but that's what they wanted, perhaps just for easy reading for some form of analysis. That's the easy part, though. Getting the groups of 5 day spans is the hard part.
This poor ol' head of mine isn't working like it used to and so I thought I'd ask for some help. Is there a way to produce the output of the left 3 columns (Score, ActivationTime, Grp) in a manner (other than RBAR) to do this with Windowing Functions or some such? If so, I sure would like to see it.
Thanks for the help folks. Looking forward to answers even if it's "Nope... can't be done" (without the Quirky Update).
I posted this in the 2022 forum just in case there's a new wonder there but it would be cool to see it solved using SQL Server 2016.
--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18363 More actions · Answer 1

I think I would resort to the RBAR cursor method:

DROP TABLE IF EXISTS #Results;
CREATE TABLE #Results(Score int, ActivationTime datetime, Grp int, BaseDate date)

DECLARE @Score int, @ActivationTime datetime
DECLARE @BaseDate datetime, @Counter int = 0, @Grp int = 1

DECLARE myCursor CURSOR LOCAL FAST_FORWARD
    FOR select h.Score, h.ActivationTime
          from #MyHead h
         order by ActivationTime desc

OPEN myCursor
FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
SET @BaseDate = @ActivationTime
WHILE @@FETCH_STATUS = 0 BEGIN
    IF @BaseDate > DATEADD(dd, 5, @ActivationTime) BEGIN
        SELECT @BaseDate=@ActivationTime, @Grp+=1
    END
    INSERT INTO #Results(Score, ActivationTime, Grp, BaseDate) VALUES (@Score, @ActivationTime, @Grp, @BaseDate)
    FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
END
CLOSE myCursor
DEALLOCATE myCursor

SELECT * 
  FROM #Results
 ORDER BY Grp, Score DESC;

results

The difference between 2017-11-12 09:46 and 2017-11-17 09:38 is 8 minutes less than 5 days so is in Grp 2.

What SQL did you use to solve it?

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 2

Jonathan AC Roberts wrote:

What SQL did you use to solve it?

Ah... sorry. Like I said, I used the "Quirky Update" method, which allows variables to interplay with columns as each row passes, just like RBAR methods but at set-based speeds. Microsoft has it documented but they say it's not controllable for more than 1 row at a time. I've been using it literally for decades (it was the only non-RBAR method to do running totals prior to 2012) and I wish two things... that MS would figure out that it actually is controllable and I wish they'd figure out how to make it so we could do it in SELECTs. The closest they've come, so far, is the Windowing functions. I'm not sure how to limit one in such a fashion to effectively find the totally unpredictable "BaseDate"s in this problem.

This could probably be done using Windowing functions but I think the resource usage would be a whole lot higher and performance wouldn't be quite as good.

Here's my code that uses the test data above. I've repeated the test data here just to make things easy.

(Lordy, the double-spacing of code the code windows on this site does makes this look crazy long... It's actually kinda short for what it does).

--=================================================================================================
--      Original data by the OP but sorted by date in descending order so we can see things.
--=================================================================================================
   DROP TABLE IF EXISTS #MyHead;
GO
 SELECT *
   INTO #MyHead
   FROM (VALUES
 (1  ,'2017-11-24 09:45:00.000') --5 day span in this group
,(3  ,'2017-11-23 09:43:00.000')
,(4  ,'2017-11-22 09:40:00.000')
,(5  ,'2017-11-21 09:39:00.000')
,(12 ,'2017-11-20 09:49:00.000')
,(9  ,'2017-11-20 09:37:00.000')

,(7  ,'2017-11-17 09:38:00.000') -- This is what group 2 should be.
,(2  ,'2017-11-15 09:41:00.000') 
,(0  ,'2017-11-14 09:44:00.000')
,(15 ,'2017-11-13 09:47:00.000')
--,(3  ,'2017-11-13 09:37:57.570') --Removing this row of data caused "Accepted" solution to fail
                                   --but the QU method has no issues whether it's present or not.                                
,(7  ,'2017-11-12 09:46:00.000')
,(20 ,'2017-11-11 09:42:00.000')
,(4  ,'2017-11-10 09:28:00.000')
        )v(Score,ActivationTime)
;
GO
--=================================================================================================
--      My High performance "Quirky Update" solution.
--=================================================================================================
--===== Create and populate a tempory table, much like a Windowing function does but physical
     -- so we can reuse it.  Note that the CI is quintessential here and the temp table is
     -- necessary because the original table may have other indexes on it or the CI may change.
     -- Again, this is similar to how Windowing functions work but without the CI.
   DROP TABLE IF EXISTS #QU;
GO
 CREATE TABLE #QU
        (
         RowNum         INT PRIMARY KEY CLUSTERED
        ,Score          INT
        ,ActivationTime DATETIME
        ,WholeDate      AS CONVERT(DATE,ActivationTime) PERSISTED
        ,Grp            INT
        ,BaseDate       DATE
        )
;
--===== This does the "sequencing"
 INSERT INTO #QU WITH (TABLOCK) --For Minimal Logging if not in FULL Recovery Model
        (RowNum,Score,ActivationTime)
 SELECT  RowNum = ROW_NUMBER() OVER (ORDER BY ActivationTime DESC, Score DESC)
        ,Score
        ,ActivationTIme
   FROM #MyHead
  ORDER BY  ActivationTime DESC --Pedantic assurance of order and Minimal Logging
           ,Score DESC 
 OPTION (MAXDOP 1)              --Pedantic assurance.
;
--===== Uncomment to see the contents of the Temp Table.
-- SELECT *
--   FROM #QU 
--  ORDER BY ActivationTime DESC, Score DESC
--;
---------------------------------------------------------------------------------------------------
--===== Some necessary variables much like a RBAR method would require.
DECLARE
         @RowSafety INT  = 1
        ,@Grp       INT  = 0
        ,@LastDate  DATE = '99991231'
        ,@GrpDate   DATE = '99991230'
;
--===== Solve the problem with the "Quirky Update" method.
     -- Works exactly the same as a While loop but without the RBAR.
     -- It uses the "Pseudo-Cursor" of the table rows.
 UPDATE tgt
    SET @Grp        = tgt.Grp  = @Grp 
                               + IIF(DATEDIFF(dd,tgt.WholeDate,@GrpDate)>=5 
                                     AND @LastDate <> tgt.WholeDate
                                     ,1, 0)
        ,@GrpDate   = BaseDate = IIF(DATEDIFF(dd,tgt.WholeDate,@GrpDate)>=5 
                                     AND @LastDate <> tgt.WholeDate
                                    ,tgt.WholeDate, @GrpDate)
        ,@LastDate  = tgt.WholeDate
        ,@RowSafety = @RowSafety + IIF(RowNum = @RowSafety, 1, 1/0) --Produces an error ONLY if not synced
   FROM #QU tgt WITH (INDEX(1)) --Force the usage of the CI in the temp table to control the logical order.
 OPTION (MAXDOP 1) --Absolutely required to prevent parallelism for this serial action.
;
--===== Display the results.
     -- Notice that the final display order is differnt than the processing order
     -- because of the requirement to sort by Score within each group.
 SELECT Score, ActivationTime, Grp, BaseDate FROM #QU
  ORDER BY Grp ASC, Score ASC, ActivationTime DESC
;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 3

p.s. Your "Group 2" is incorrect according to the requirements. Just to be sure, they're based on WHOLE days and the date span must be 5 days or less based on the data.

The fellow with the "Accepted" answer made the same mistake and worse because he used DENSE_RANK on the days. Unfortunately for him, there were enough days in the original data for group 2 to make it look like it worked but resulted in the silent failure.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ken McKelvey SSCoach Points: 18856 More actions · Answer 4

I do not think this type of problem is solvable with windowed functions as the result is dependent on previous values in the series. In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18). Maybe someone with a more mathematical background can prove this.

Like you I suspect the quirky update will be the quickest t-sql solution but a cursor or recursion may be fine with less than 5,000 rows. If the result is needed for an application the best solution is probably to off load the grouping and additional sorting to the application servers as they will be easier and cheaper to scale.

This reply was modified 1 year, 12 months ago by Ken McKelvey.

Steve Collins SSCrazy Points: 2174 More actions · Answer 5

Jeff Moden wrote:

The fellow with the "Accepted" answer made the same mistake and worse because he used DENSE_RANK on the days. Unfortunately for him, there were enough days in the original data for group 2 to make it look like it worked but resulted in the silent failure.

Yes DENSE_RANK seems inappropriate and only appeared to work imo. The issue seems to be I'm not finding a solution without 'time' (the series summation of prior gap adjustments) being part of the equation. The inter-group gaps would need to be carried forward afaik. Idk

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 6

@ken McKelvey. Thanks for looking and thanks for the confirmation in a couple of areas, here, Ken.

I've seen some mathematical magic in the past with "Gaps and Islands", which this problem actually is, but with the twist of limiting how long the island can be. I'm going to look a bit more deeply into that possibility.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 7

@ Peter Larsson...

You've been popping up on some of the formula related posts that I've posted to and you're listed as a "follower". Any ideas/solutions here?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 8

@ Steve Collins ,

Agreed and, except for RBAR or "Quirky Update" solutions, I'm not seeing a way to do this. This is also kind of like the "Bin Fill" problem. I've not seen a "Single Query" solution even with multiple sub-queries in the form of CTEs, actual sub-queries, APPLYs, what have you.

I was thinking about Itzik's Ben-Gan's ultimate "Packing Intervals" solution, which also has interval forgiveness, but it doesn't have the nuance of 5 day Bins with separate "Bin Start Dates" that aren't necessarily less than 5 days apart.

Fun stuff.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 9

@ Jonathan, Ken, and Steve,

Thank all 3 of you for lending and ear and basically confirming that this is actually a real booger of a problem. I really appreciate it, especially between the two end of year holidays.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2174 More actions · Answer 10

One approach could be to use a WHILE loop to increment grp values based on row counts and the DATEDIFF interval

Starting with Jeff's code:

--=================================================================================================
--      Original data by the OP but sorted by date in descending order so we can see things.
--=================================================================================================
   DROP TABLE IF EXISTS #MyHead;
GO
 SELECT *
   INTO #MyHead
   FROM (VALUES
 (1  ,'2017-11-24 09:45:00.000') --5 day span in this group
,(3  ,'2017-11-23 09:43:00.000')
,(4  ,'2017-11-22 09:40:00.000')
,(5  ,'2017-11-21 09:39:00.000')
,(12 ,'2017-11-20 09:49:00.000')
,(9  ,'2017-11-20 09:37:00.000')

,(7  ,'2017-11-17 09:38:00.000') -- This is what group 2 should be.
,(2  ,'2017-11-15 09:41:00.000') 
,(0  ,'2017-11-14 09:44:00.000')
,(15 ,'2017-11-13 09:47:00.000')
--,(3  ,'2017-11-13 09:37:57.570') --Removing this row of data caused "Accepted" solution to fail
                                   --but the QU method has no issues whether it's present or not.                                
,(7  ,'2017-11-12 09:46:00.000')
,(20 ,'2017-11-11 09:42:00.000')
,(4  ,'2017-11-10 09:28:00.000')
        )v(Score,ActivationTime)
;
GO
--=================================================================================================
--      My High performance "Quirky Update" solution.
--=================================================================================================
--===== Create and populate a tempory table, much like a Windowing function does but physical
     -- so we can reuse it.  Note that the CI is quintessential here and the temp table is
     -- necessary because the original table may have other indexes on it or the CI may change.
     -- Again, this is similar to how Windowing functions work but without the CI.
declare
  @qu_row_count             int=0,
  @grp_row_count            int=0,
  @grp                      int=1;

DROP TABLE IF EXISTS #QU;
 CREATE TABLE #QU
        (
         RowNum         INT PRIMARY KEY CLUSTERED
        ,Score          INT
        ,ActivationTime DATETIME
        ,WholeDate      AS CONVERT(DATE,ActivationTime) PERSISTED
        ,Grp            INT
        ,BaseDate       DATE
        )
;
--===== This does the "sequencing"
 INSERT INTO #QU WITH (TABLOCK) --For Minimal Logging if not in FULL Recovery Model
        (RowNum,Score,ActivationTime)
 SELECT  RowNum = ROW_NUMBER() OVER (ORDER BY ActivationTime DESC, Score DESC)
        ,Score
        ,ActivationTIme
   FROM #MyHead
  ORDER BY  ActivationTime DESC --Pedantic assurance of order and Minimal Logging
           ,Score DESC 
 OPTION (MAXDOP 1)              --Pedantic assurance.
;
select @qu_row_count=@@rowcount;


while @qu_row_count<>@grp_row_count
    begin
        with top_cte as (
            select top(1) ActivationTime
            from #qu
            where grp is null
            order by ActivationTime desc)
        update qu
        set grp=@grp
        from #qu qu
             cross join top_cte tc
        where qu.grp is null
              and datediff(day, qu.ActivationTime, tc.ActivationTime)<5;
        
        select @grp_row_count=@grp_row_count+@@rowcount;
        select @grp=@grp+1;
    end


--===== Uncomment to see the contents of the Temp Table.
 SELECT *
   FROM #QU 
  ORDER BY ActivationTime DESC, Score DESC
;

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Jeff Moden SSC Guru Points: 1004410 More actions · Answer 11

I could be wrong but that looks like it will do a full scan of the table for each grouping of rows. Why not just do a single pass with the While loop using logic similar to what's in the Quirky Update?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2174 More actions · Answer 12

Jeff Moden wrote:

I could be wrong but that looks like it will do a full scan of the table for each grouping of rows. Why not just do a single pass with the While loop using logic similar to what's in the Quirky Update?

It said "One approach" not the best or only way. The WHERE clause is not sarge-able, I agree. When I switched to what seemingly seemed to be equivalent code the row result was not the same. Not sure why:

qu.ActivationTime>dateadd(day, 5, tc.ActivationTime);

One item which seems important and yet hasn't been established afaik is whether or not the 'ActivationTime' column is a unique key. It seems possible there are duplicate rows and if so self joins seem risky but I'm not sure. My WHILE loop checks to confirm the correct number of rows were updated

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Ken McKelvey SSCoach Points: 18856 More actions · Answer 13

Ken McKelvey wrote:

In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).

This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.

SELECT Score, ActivationTime
    ,DENSE_RANK() OVER
    (
        ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
    ) AS Grp
FROM #MyHead
ORDER BY Grp, Score DESC;

Of course, it would be easy for additional test data to make this approach wrong.

frederico_fonseca SSCoach Points: 16167 More actions · Answer 14

my attempt at this.

Do note that the 5 days interval is partially misleading as to the final result because of the time part.

,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here

,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, but if using the time part on the 5 days rule then this should be on the bottom of group 2

this can be tested easily by changing the cast to date below to be datetime instead

drop table if exists #myhead
 SELECT score
      , cast(Activationtime as datetime) as ActivationTime
   INTO #MyHead
   FROM (VALUES
         (1  ,'2017-11-24 09:45:00.000') --These should be in Grp #1.
        ,(3  ,'2017-11-23 09:43:00.000') --The 24th thru the 20th is a 5 day span
        ,(4  ,'2017-11-22 09:40:00.000')
        ,(5  ,'2017-11-21 09:39:00.000')
        ,(12 ,'2017-11-20 09:49:00.000')
        ,(9  ,'2017-11-20 09:37:00.000')
                                         --<--<< Notice that there are 2 days in this gap.
        ,(7  ,'2017-11-17 09:38:00.000') --Group 2 starts here, as does a new 
        ,(2  ,'2017-11-15 09:41:00.000') --span start date. Even though there are
        ,(0  ,'2017-11-14 09:44:00.000') --less than 5 entries, the span goes from
        ,(15 ,'2017-11-13 09:47:00.000') --the 17 to the 13th, which is 5 days
                                         --<--<< Notice that this actually isn't a gap.
        ,(7  ,'2017-11-12 09:46:00.000') --Group 3 starts here, as does a new 
        ,(20 ,'2017-11-11 09:42:00.000') --span start date
        ,(4  ,'2017-11-10 09:28:00.000')
        )v(Score,ActivationTime)
;

with tops as
(select *
 from (select top 1 ActivationTime
       from #MyHead
       order by ActivationTime desc
      ) t
union all
select t2.activationtime
from tops tp1
outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
                  , ActivationTime
             from #MyHead mh1
             --where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
             where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
             ) t2
where t2.rownum = 1
)
, base as 
(select *
      , row_number() over (order by tp1.ActivationTime desc) as grp
--      , dateadd(day, -5, tp1.activationtime) as lowdate -- CHANGE to line below
      , dateadd(day, -5, convert(date, tp1.activationtime)) as lowdate

 from tops tp1
)
select mh.Score
     , mh.ActivationTime
     , base.grp
from base 
inner join #MyHead mh
--on mh.ActivationTime > base.lowdate -- CHANGE to line below
on convert(date, mh.ActivationTime) > base.lowdate 
and mh.ActivationTime <= base.ActivationTime
order by base.grp
       , mh.ActivationTime desc
       , mh.Score
OPTION (MAXRECURSION 0)

This reply was modified 1 year, 12 months ago by frederico_fonseca. Reason: change to remove an error - lines changed tagged with " -- CHANGE to line below"