Is there another way to solve this "group in 5 day spans" problem?

Question

Is there another way to solve this "group in 5 day spans" problem?

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply

frederico_fonseca SSCoach Points: 16167 More actions · Answer 1

Ken McKelvey wrote:

Ken McKelvey wrote:
In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).
This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.
SELECT Score, ActivationTime
    ,DENSE_RANK() OVER
    (
        ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
    ) AS Grp
FROM #MyHead
ORDER BY Grp, Score DESC;
Of course, it would be easy for additional test data to make this approach wrong.

while it works with the data you have it fails greatly with bigger volumes

try a table as follow and look at what group 1 is

drop table if exists #myhead;

  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT 0 UNION ALL
                 SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                )
select n + 1 as Score
     , dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
into #myHead
from cteTally
;

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 2

frederico_fonseca wrote:

my attempt at this.

Do note that the 5 days interval is partially misleading as to the final result because of the time part.

,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here ,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, but if using the time part on the 5 days rule then this should be on the bottom of group 2

this can be tested easily by changing the cast to date below to be datetime instead

drop table if exists #myhead
 SELECT score
      , cast(Activationtime as datetime) as ActivationTime
   INTO #MyHead
   FROM (VALUES
         (1  ,'2017-11-24 09:45:00.000') --These should be in Grp #1.
        ,(3  ,'2017-11-23 09:43:00.000') --The 24th thru the 20th is a 5 day span
        ,(4  ,'2017-11-22 09:40:00.000')
        ,(5  ,'2017-11-21 09:39:00.000')
        ,(12 ,'2017-11-20 09:49:00.000')
        ,(9  ,'2017-11-20 09:37:00.000')
                                         --<--<< Notice that there are 2 days in this gap.
        ,(7  ,'2017-11-17 09:38:00.000') --Group 2 starts here, as does a new 
        ,(2  ,'2017-11-15 09:41:00.000') --span start date. Even though there are
        ,(0  ,'2017-11-14 09:44:00.000') --less than 5 entries, the span goes from
        ,(15 ,'2017-11-13 09:47:00.000') --the 17 to the 13th, which is 5 days
                                         --<--<< Notice that this actually isn't a gap.
        ,(7  ,'2017-11-12 09:46:00.000') --Group 3 starts here, as does a new 
        ,(20 ,'2017-11-11 09:42:00.000') --span start date
        ,(4  ,'2017-11-10 09:28:00.000')
        )v(Score,ActivationTime)
;

with tops as
(select *
 from (select top 1 ActivationTime
       from #MyHead
       order by ActivationTime desc
      ) t
union all
select t2.activationtime
from tops tp1
outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
                  , ActivationTime
             from #MyHead mh1
             --where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
             where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
             ) t2
where t2.rownum = 1
)
, base as 
(select *
      , row_number() over (order by tp1.ActivationTime desc) as grp
      , dateadd(day, -5, tp1.activationtime) as lowdate
 from tops tp1
)
select mh.Score
     , mh.ActivationTime
     , base.grp
from base 
inner join #MyHead mh
on mh.ActivationTime > base.lowdate 
and mh.ActivationTime <= base.ActivationTime
order by base.grp
       , mh.ActivationTime desc
       , mh.Score
OPTION (MAXRECURSION 0)

Interesting. I'll take a look and have a go at it. Thanks Frederico.

To be sure, it appeared that the OP was stressing "Days" and so I took that to mean "ignore times". The OP marked the "Accepted Answer" as one that ignored times, so that's a bit of confirmation there. Again and just to be sure, that "Accepted Answer" on the SO thread breaks unless the data is "just right".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 3

Ken McKelvey wrote:

Ken McKelvey wrote:
In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).
This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.
SELECT Score, ActivationTime
    ,DENSE_RANK() OVER
    (
        ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
    ) AS Grp
FROM #MyHead
ORDER BY Grp, Score DESC;
Of course, it would be easy for additional test data to make this approach wrong.

Great observation about the gap being on the weekend but, as you said, additional data would make this approach wrong. Also and just to provide a bit of feedback, the "Accepted Answer" on the SO thread also used "Dense_Rank", which is why it broke when I changed the data by removing that one line of sample data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

frederico_fonseca SSCoach Points: 16167 More actions · Answer 4

@jeff-moden - did a small change on original code to "correct" one duplicated entry that was on both group 2 and 3 (darn time part messing up)

This reply was modified 1 year, 12 months ago by frederico_fonseca.
This reply was modified 1 year, 12 months ago by frederico_fonseca.

autoexcrement SSCertifiable Points: 5956 More actions · Answer 5

@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.

I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.

And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 6

That certainly solves the problem without an explicit WHILE loop or a "Quirky Update". I used the following code the add 5 more sets of 3 groups each to "play" a bit.

INSERT INTO #MyHead
        (Score,ActivationTime)
 SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
   FROM #MyHead
  CROSS APPLY (VALUES(-1),(-2),(-3),(-4),(-5))t(N)

To quantify the data, that means there are 18 "top" dates and 78 rows in total. The code produces the correct answer but, oh my... there are two Cartesian products in the execution plan of 1404 (18*78) rows read each. I don't know if the execution plan would change or not to reduce that if a larger number of rows were introduced to the source table. I'll test with that but I believe it's not going to change the execution plan to get rid of the Cartesian products.

Despite that, it IS some interesting code and thanks for taking the time to design it, write it, and post it!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 7

I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.

Too bad because, like I said, that's some interesting code you wrote! Thanks again.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

frederico_fonseca SSCoach Points: 16167 More actions · Answer 8

autoexcrement wrote:

@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.
I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.
And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.

my code does not care about the size of the gap nor of how many groups there are (there is a recursion limit though) - it can be 1 day or 200 days that it will still identify the next "top entry" on the 5 day block.

and I did try it with 900 days worth of data and it did 180 groups

see it using this code with both gaps and over 160 groups

drop table if exists #myhead;

  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT 0 UNION ALL
                 SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                )
select n + 1 as Score
     , dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
into #myHead
from cteTally
where not (n between 10 and 12
or n between 25 and 35
or n between 100 and 150
or n between 175 and 177
or n between 201 and 202
or n between 500 and 600)
;

with tops as
(select *
 from (select top 1 ActivationTime
       from #MyHead
       order by ActivationTime desc
      ) t
union all
select t2.activationtime
from tops tp1
outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
                  , ActivationTime
             from #MyHead mh1
             --where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
             where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
             ) t2
where t2.rownum = 1
)
, base as 
(select *
      , row_number() over (order by tp1.ActivationTime desc) as grp
      , dateadd(day, -5, convert(date, tp1.activationtime)) as lowdate
 from tops tp1
)
select mh.Score
     , mh.ActivationTime
     , base.grp
     , base.*
from base 
inner join #MyHead mh
on convert(date, mh.ActivationTime) > base.lowdate 
and mh.ActivationTime <= base.ActivationTime
order by base.grp
       , mh.ActivationTime desc
       , mh.Score desc
OPTION (MAXRECURSION 0)

frederico_fonseca SSCoach Points: 16167 More actions · Answer 9

Jeff Moden wrote:

I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.
Too bad because, like I said, that's some interesting code you wrote! Thanks again.

mind giving the code to generate those sets

Jonathan AC Roberts SSCoach Points: 18363 More actions · Answer 10

autoexcrement wrote:

@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.
I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.
And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.

My first answer worked for date/times this works for dates.

SET NOCOUNT ON
DROP TABLE IF EXISTS #Results;
CREATE TABLE #Results(Score int, ActivationTime datetime, Grp int, BaseDate date /*, CONSTRAINT PK_#Results PRIMARY KEY CLUSTERED (ActivationTime DESC,Score) */);

DECLARE @Score int, @ActivationTime datetime
DECLARE @BaseDate datetime, @Counter int = 0, @Grp int = 1

DECLARE myCursor CURSOR FAST_FORWARD
    FOR select h.Score, h.ActivationTime
          from #MyHead h
         order by ActivationTime desc

DECLARE @Handle int;  
EXEC sys.sp_prepare @Handle OUTPUT,   
    N'@Score int, @ActivationTime datetime, @Grp int, @BaseDate date',  
    N'INSERT INTO #Results WITH (TABLOCK) (Score, ActivationTime, Grp, BaseDate) VALUES (@Score, @ActivationTime, @Grp, @BaseDate)';  
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
SET @BaseDate = @ActivationTime
WHILE @@FETCH_STATUS = 0 BEGIN
    IF @BaseDate > DATEADD(dd, 5, CONVERT(DATE,@ActivationTime)) BEGIN
        SELECT @BaseDate = @ActivationTime, @Grp += 1
    END
    EXEC sp_execute @Handle, @Score, @ActivationTime, @Grp, @BaseDate
    FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
END
CLOSE myCursor
DEALLOCATE myCursor
EXEC sp_unprepare @Handle

SELECT * 
  FROM #Results
 ORDER BY Grp, Score DESC;

I tried using sp_prepare to speed the inner loop up but it made no difference.

With this data about 217k rows

DROP TABLE IF EXISTS #MyHead;
GO
select CRYPT_GEN_RANDOM(2) % 30 Score, convert(datetime,value) ActivationTime
  into #MyHead
  from dbo.DateRange('20230101','20230601','mi',1);

it runs in 16 seconds on my machine compared to about 5 seconds for Jeff's quirky update.

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 11

frederico_fonseca wrote:

Jeff Moden wrote:
I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.
Too bad because, like I said, that's some interesting code you wrote! Thanks again.
mind giving the code to generate those sets

It's very similar to the code that I used to generate the 5 extra sets except it uses "fnTally" to create the numbers for the DATEADD(). Here's the code...

INSERT INTO #MyHead
        (Score,ActivationTime)
 SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
   FROM #MyHead
  CROSS APPLY dbo.fnTally(1,1000)t

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

frederico_fonseca SSCoach Points: 16167 More actions · Answer 12

thanks Jeff.

regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.

a index on activationtime also required for the cte part if volumes are significant.

small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.

(this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)

This reply was modified 1 year, 12 months ago by frederico_fonseca. Reason: add machine spec

autoexcrement SSCertifiable Points: 5956 More actions · Answer 13

@frederico_fonseca I apologize, I missed that you were doing a recursive loop there. My bad.

"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

frederico_fonseca SSCoach Points: 16167 More actions · Answer 14

autoexcrement wrote:

@frederico_fonseca I apologize, I missed that you were doing a recursive loop there. My bad.

don't worry about it.

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 15

frederico_fonseca wrote:

thanks Jeff.
regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.
a index on activationtime also required for the cte part if volumes are significant.
small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.
(this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)

Thanks, Frederico. I'll check it out.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)