Is there another way to solve this "group in 5 day spans" problem?

  • Ken McKelvey wrote:

    Ken McKelvey wrote:

    In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).

    This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.

    SELECT Score, ActivationTime
    ,DENSE_RANK() OVER
    (
    ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
    ) AS Grp
    FROM #MyHead
    ORDER BY Grp, Score DESC;

    Of course, it would be easy for additional test data to make this approach wrong.

    while it works with the data you have it fails greatly with bigger volumes

    try a table as follow and look at what group 1 is

    drop table if exists #myhead;

    WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ), --10E+1 or 10 rows
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
    -- for both a performance gain and prevention of accidental "overruns"
    SELECT 0 UNION ALL
    SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    )
    select n + 1 as Score
    , dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
    into #myHead
    from cteTally
    ;
  • frederico_fonseca wrote:

    my attempt at this.

    Do note that the 5 days interval is partially misleading as to the final result because of the time part.

    ,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here ,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, but if using the time part on the 5 days rule then this should be on the bottom of group 2

    this can be tested easily by changing the cast to date below to be datetime instead

    drop table if exists #myhead
    SELECT score
    , cast(Activationtime as datetime) as ActivationTime
    INTO #MyHead
    FROM (VALUES
    (1 ,'2017-11-24 09:45:00.000') --These should be in Grp #1.
    ,(3 ,'2017-11-23 09:43:00.000') --The 24th thru the 20th is a 5 day span
    ,(4 ,'2017-11-22 09:40:00.000')
    ,(5 ,'2017-11-21 09:39:00.000')
    ,(12 ,'2017-11-20 09:49:00.000')
    ,(9 ,'2017-11-20 09:37:00.000')
    --<--<< Notice that there are 2 days in this gap.
    ,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here, as does a new
    ,(2 ,'2017-11-15 09:41:00.000') --span start date. Even though there are
    ,(0 ,'2017-11-14 09:44:00.000') --less than 5 entries, the span goes from
    ,(15 ,'2017-11-13 09:47:00.000') --the 17 to the 13th, which is 5 days
    --<--<< Notice that this actually isn't a gap.
    ,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, as does a new
    ,(20 ,'2017-11-11 09:42:00.000') --span start date
    ,(4 ,'2017-11-10 09:28:00.000')
    )v(Score,ActivationTime)
    ;

    with tops as
    (select *
    from (select top 1 ActivationTime
    from #MyHead
    order by ActivationTime desc
    ) t
    union all
    select t2.activationtime
    from tops tp1
    outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
    , ActivationTime
    from #MyHead mh1
    --where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
    where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
    ) t2
    where t2.rownum = 1
    )
    , base as
    (select *
    , row_number() over (order by tp1.ActivationTime desc) as grp
    , dateadd(day, -5, tp1.activationtime) as lowdate
    from tops tp1
    )
    select mh.Score
    , mh.ActivationTime
    , base.grp
    from base
    inner join #MyHead mh
    on mh.ActivationTime > base.lowdate
    and mh.ActivationTime <= base.ActivationTime
    order by base.grp
    , mh.ActivationTime desc
    , mh.Score
    OPTION (MAXRECURSION 0)

    Interesting.  I'll take a look and have a go at it.  Thanks Frederico.

    To be sure, it appeared that the OP was stressing "Days" and so I took that to mean "ignore times".  The OP marked the "Accepted Answer" as one that ignored times, so that's a bit of confirmation there.  Again and just to be sure, that "Accepted Answer" on the SO thread breaks unless the data is "just right".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ken McKelvey wrote:

    Ken McKelvey wrote:

    In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).

    This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.

    SELECT Score, ActivationTime
    ,DENSE_RANK() OVER
    (
    ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
    ) AS Grp
    FROM #MyHead
    ORDER BY Grp, Score DESC;

    Of course, it would be easy for additional test data to make this approach wrong.

    Great observation about the gap being on the weekend but, as you said, additional data would make this approach wrong.  Also and just to provide a bit of feedback, the "Accepted Answer" on the SO thread also used "Dense_Rank", which is why it broke when I changed the data by removing that one line of sample data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff-moden - did a small change on original code to "correct" one duplicated entry that was on both group 2 and 3 (darn time part messing up)

  • @frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.

    I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.

    And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • That certainly solves the problem without an explicit WHILE loop or a "Quirky Update".  I used the following  code the add 5 more sets of 3 groups each to "play" a bit.

    INSERT INTO #MyHead
    (Score,ActivationTime)
    SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
    FROM #MyHead
    CROSS APPLY (VALUES(-1),(-2),(-3),(-4),(-5))t(N)

    To quantify the data, that means there are 18 "top" dates and 78 rows in total.  The code produces the correct answer but, oh my... there are two Cartesian products in the execution plan of 1404 (18*78) rows read each.  I don't know if the execution plan would change or not to reduce that if a larger number of rows were introduced to the source table.  I'll test with that but I believe it's not going to change the execution plan to get rid of the Cartesian products.

    Despite that, it IS some interesting code and thanks for taking the time to design it, write it, and post it!

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I added 1000 "sets" of 3 groups, each with 13 dates.  The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.

    Too bad because, like I said, that's some interesting code you wrote!  Thanks again.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • autoexcrement wrote:

    @frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.

    I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.

    And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.

    my code does not care about the size of the gap nor of how many groups there are (there is a recursion limit though) - it can be 1 day or 200 days that it will still identify the next "top entry" on the 5 day block.

    and I did try it with 900 days worth of data and it did 180 groups

    see it using this code with both gaps and over 160 groups

    drop table if exists #myhead;

    WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ), --10E+1 or 10 rows
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
    -- for both a performance gain and prevention of accidental "overruns"
    SELECT 0 UNION ALL
    SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    )
    select n + 1 as Score
    , dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
    into #myHead
    from cteTally
    where not (n between 10 and 12
    or n between 25 and 35
    or n between 100 and 150
    or n between 175 and 177
    or n between 201 and 202
    or n between 500 and 600)
    ;

    with tops as
    (select *
    from (select top 1 ActivationTime
    from #MyHead
    order by ActivationTime desc
    ) t
    union all
    select t2.activationtime
    from tops tp1
    outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
    , ActivationTime
    from #MyHead mh1
    --where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
    where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
    ) t2
    where t2.rownum = 1
    )
    , base as
    (select *
    , row_number() over (order by tp1.ActivationTime desc) as grp
    , dateadd(day, -5, convert(date, tp1.activationtime)) as lowdate
    from tops tp1
    )
    select mh.Score
    , mh.ActivationTime
    , base.grp
    , base.*
    from base
    inner join #MyHead mh
    on convert(date, mh.ActivationTime) > base.lowdate
    and mh.ActivationTime <= base.ActivationTime
    order by base.grp
    , mh.ActivationTime desc
    , mh.Score desc
    OPTION (MAXRECURSION 0)
  • Jeff Moden wrote:

    I added 1000 "sets" of 3 groups, each with 13 dates.  The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.

    Too bad because, like I said, that's some interesting code you wrote!  Thanks again.

    mind giving the code to generate those sets

  • autoexcrement wrote:

    @frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.

    I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.

    And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.

    My first answer worked for date/times this works for dates.

    SET NOCOUNT ON
    DROP TABLE IF EXISTS #Results;
    CREATE TABLE #Results(Score int, ActivationTime datetime, Grp int, BaseDate date /*, CONSTRAINT PK_#Results PRIMARY KEY CLUSTERED (ActivationTime DESC,Score) */);

    DECLARE @Score int, @ActivationTime datetime
    DECLARE @BaseDate datetime, @Counter int = 0, @Grp int = 1

    DECLARE myCursor CURSOR FAST_FORWARD
    FOR select h.Score, h.ActivationTime
    from #MyHead h
    order by ActivationTime desc

    DECLARE @Handle int;
    EXEC sys.sp_prepare @Handle OUTPUT,
    N'@Score int, @ActivationTime datetime, @Grp int, @BaseDate date',
    N'INSERT INTO #Results WITH (TABLOCK) (Score, ActivationTime, Grp, BaseDate) VALUES (@Score, @ActivationTime, @Grp, @BaseDate)';
    OPEN myCursor
    FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
    SET @BaseDate = @ActivationTime
    WHILE @@FETCH_STATUS = 0 BEGIN
    IF @BaseDate > DATEADD(dd, 5, CONVERT(DATE,@ActivationTime)) BEGIN
    SELECT @BaseDate = @ActivationTime, @Grp += 1
    END
    EXEC sp_execute @Handle, @Score, @ActivationTime, @Grp, @BaseDate
    FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
    END
    CLOSE myCursor
    DEALLOCATE myCursor
    EXEC sp_unprepare @Handle

    SELECT *
    FROM #Results
    ORDER BY Grp, Score DESC;

    I tried using sp_prepare to speed the inner loop up but it made no difference.

    With this data about 217k rows

    DROP TABLE IF EXISTS #MyHead;
    GO
    select CRYPT_GEN_RANDOM(2) % 30 Score, convert(datetime,value) ActivationTime
    into #MyHead
    from dbo.DateRange('20230101','20230601','mi',1);

    it runs in 16 seconds on my machine compared to about 5 seconds for Jeff's quirky update.

  • frederico_fonseca wrote:

    Jeff Moden wrote:

    I added 1000 "sets" of 3 groups, each with 13 dates.  The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.

    Too bad because, like I said, that's some interesting code you wrote!  Thanks again.

    mind giving the code to generate those sets

    It's very similar to the code that I used to generate the 5 extra sets except it uses "fnTally" to create the numbers for the DATEADD().  Here's the code...

    INSERT INTO #MyHead
    (Score,ActivationTime)
    SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
    FROM #MyHead
    CROSS APPLY dbo.fnTally(1,1000)t

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • thanks Jeff.

    regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.

    a index on activationtime also required for the cte part if volumes are significant.

    small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.

    (this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)

     

    • This reply was modified 1 year, 10 months ago by  frederico_fonseca. Reason: add machine spec
  • @frederico_fonseca I apologize, I missed that you were doing a recursive loop there. My bad.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • autoexcrement wrote:

    @frederico_fonseca I apologize, I missed that you were doing a recursive loop there. My bad.

    don't worry about it.

  • frederico_fonseca wrote:

    thanks Jeff.

    regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.

    a index on activationtime also required for the cte part if volumes are significant.

    small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.

    (this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)

    Thanks, Frederico.  I'll check it out.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply