December 30, 2022 at 1:31 pm
Ken McKelvey wrote:In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).
This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.
SELECT Score, ActivationTime
,DENSE_RANK() OVER
(
ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
) AS Grp
FROM #MyHead
ORDER BY Grp, Score DESC;Of course, it would be easy for additional test data to make this approach wrong.
while it works with the data you have it fails greatly with bigger volumes
try a table as follow and look at what group 1 is
drop table if exists #myhead;
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select n + 1 as Score
, dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
into #myHead
from cteTally
;
December 30, 2022 at 4:36 pm
my attempt at this.
Do note that the 5 days interval is partially misleading as to the final result because of the time part.
,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here ,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, but if using the time part on the 5 days rule then this should be on the bottom of group 2
this can be tested easily by changing the cast to date below to be datetime instead
drop table if exists #myhead
SELECT score
, cast(Activationtime as datetime) as ActivationTime
INTO #MyHead
FROM (VALUES
(1 ,'2017-11-24 09:45:00.000') --These should be in Grp #1.
,(3 ,'2017-11-23 09:43:00.000') --The 24th thru the 20th is a 5 day span
,(4 ,'2017-11-22 09:40:00.000')
,(5 ,'2017-11-21 09:39:00.000')
,(12 ,'2017-11-20 09:49:00.000')
,(9 ,'2017-11-20 09:37:00.000')
--<--<< Notice that there are 2 days in this gap.
,(7 ,'2017-11-17 09:38:00.000') --Group 2 starts here, as does a new
,(2 ,'2017-11-15 09:41:00.000') --span start date. Even though there are
,(0 ,'2017-11-14 09:44:00.000') --less than 5 entries, the span goes from
,(15 ,'2017-11-13 09:47:00.000') --the 17 to the 13th, which is 5 days
--<--<< Notice that this actually isn't a gap.
,(7 ,'2017-11-12 09:46:00.000') --Group 3 starts here, as does a new
,(20 ,'2017-11-11 09:42:00.000') --span start date
,(4 ,'2017-11-10 09:28:00.000')
)v(Score,ActivationTime)
;
with tops as
(select *
from (select top 1 ActivationTime
from #MyHead
order by ActivationTime desc
) t
union all
select t2.activationtime
from tops tp1
outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
, ActivationTime
from #MyHead mh1
--where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
) t2
where t2.rownum = 1
)
, base as
(select *
, row_number() over (order by tp1.ActivationTime desc) as grp
, dateadd(day, -5, tp1.activationtime) as lowdate
from tops tp1
)
select mh.Score
, mh.ActivationTime
, base.grp
from base
inner join #MyHead mh
on mh.ActivationTime > base.lowdate
and mh.ActivationTime <= base.ActivationTime
order by base.grp
, mh.ActivationTime desc
, mh.Score
OPTION (MAXRECURSION 0)
Interesting. I'll take a look and have a go at it. Thanks Frederico.
To be sure, it appeared that the OP was stressing "Days" and so I took that to mean "ignore times". The OP marked the "Accepted Answer" as one that ignored times, so that's a bit of confirmation there. Again and just to be sure, that "Accepted Answer" on the SO thread breaks unless the data is "just right".
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2022 at 4:41 pm
Ken McKelvey wrote:In this case I cannot see anyway to predict when missing dates are needed (2017-11-16) or not needed (2017-11-19 & 2017-11-18).
This is a delayed message as I was away from a computer for three days. It occurred to me later that 2017-11-18 and 2017-11-19 are actually a Saturday and Sunday so for the actual published test data the complicated 5 day rule can be replace by iso_week.
SELECT Score, ActivationTime
,DENSE_RANK() OVER
(
ORDER BY YEAR(ActivationTime) * 100 + DATEPART(iso_week, ActivationTime) DESC
) AS Grp
FROM #MyHead
ORDER BY Grp, Score DESC;Of course, it would be easy for additional test data to make this approach wrong.
Great observation about the gap being on the weekend but, as you said, additional data would make this approach wrong. Also and just to provide a bit of feedback, the "Accepted Answer" on the SO thread also used "Dense_Rank", which is why it broke when I changed the data by removing that one line of sample data.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2022 at 5:19 pm
@jeff-moden - did a small change on original code to "correct" one duplicated entry that was on both group 2 and 3 (darn time part messing up)
December 30, 2022 at 5:24 pm
@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.
I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.
And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.
December 30, 2022 at 5:29 pm
That certainly solves the problem without an explicit WHILE loop or a "Quirky Update". I used the following code the add 5 more sets of 3 groups each to "play" a bit.
INSERT INTO #MyHead
(Score,ActivationTime)
SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
FROM #MyHead
CROSS APPLY (VALUES(-1),(-2),(-3),(-4),(-5))t(N)
To quantify the data, that means there are 18 "top" dates and 78 rows in total. The code produces the correct answer but, oh my... there are two Cartesian products in the execution plan of 1404 (18*78) rows read each. I don't know if the execution plan would change or not to reduce that if a larger number of rows were introduced to the source table. I'll test with that but I believe it's not going to change the execution plan to get rid of the Cartesian products.
Despite that, it IS some interesting code and thanks for taking the time to design it, write it, and post it!
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2022 at 5:36 pm
I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.
Too bad because, like I said, that's some interesting code you wrote! Thanks again.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2022 at 6:36 pm
@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.
I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.
And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.
my code does not care about the size of the gap nor of how many groups there are (there is a recursion limit though) - it can be 1 day or 200 days that it will still identify the next "top entry" on the 5 day block.
and I did try it with 900 days worth of data and it did 180 groups
see it using this code with both gaps and over 160 groups
drop table if exists #myhead;
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS ( --=== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (900) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select n + 1 as Score
, dateadd(day, -1 * n, '2017-11-24 09:45:00') as ActivationTime
into #myHead
from cteTally
where not (n between 10 and 12
or n between 25 and 35
or n between 100 and 150
or n between 175 and 177
or n between 201 and 202
or n between 500 and 600)
;
with tops as
(select *
from (select top 1 ActivationTime
from #MyHead
order by ActivationTime desc
) t
union all
select t2.activationtime
from tops tp1
outer apply (select row_number() over (order by mh1.ActivationTime desc) as rownum
, ActivationTime
from #MyHead mh1
--where cast(mh1.ActivationTime as datetime) <= dateadd(day, -5, cast(tp1.ActivationTime as datetime))
where cast(mh1.ActivationTime as date) <= dateadd(day, -5, cast(tp1.ActivationTime as date))
) t2
where t2.rownum = 1
)
, base as
(select *
, row_number() over (order by tp1.ActivationTime desc) as grp
, dateadd(day, -5, convert(date, tp1.activationtime)) as lowdate
from tops tp1
)
select mh.Score
, mh.ActivationTime
, base.grp
, base.*
from base
inner join #MyHead mh
on convert(date, mh.ActivationTime) > base.lowdate
and mh.ActivationTime <= base.ActivationTime
order by base.grp
, mh.ActivationTime desc
, mh.Score desc
OPTION (MAXRECURSION 0)
December 30, 2022 at 6:37 pm
I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.
Too bad because, like I said, that's some interesting code you wrote! Thanks again.
mind giving the code to generate those sets
December 30, 2022 at 6:40 pm
@frederico_fonseca I had the same idea as you initially, but your solution will not work either. What if there were 20 groups instead of 3? What if the gaps between the groups were huge? Etc.
I've spent a couple hours on this myself and so far have not been able to come up with anything better than Quirky Update. But I subbed to this thread to see if someone thinks of something amazing.
And I think we should all settle on Jeff's interpretation that the OP doesn't care about time and only dates. But regardless, I haven't seen any proposed answers other than Jeff's that will consistently work for dates OR times yet.
My first answer worked for date/times this works for dates.
SET NOCOUNT ON
DROP TABLE IF EXISTS #Results;
CREATE TABLE #Results(Score int, ActivationTime datetime, Grp int, BaseDate date /*, CONSTRAINT PK_#Results PRIMARY KEY CLUSTERED (ActivationTime DESC,Score) */);
DECLARE @Score int, @ActivationTime datetime
DECLARE @BaseDate datetime, @Counter int = 0, @Grp int = 1
DECLARE myCursor CURSOR FAST_FORWARD
FOR select h.Score, h.ActivationTime
from #MyHead h
order by ActivationTime desc
DECLARE @Handle int;
EXEC sys.sp_prepare @Handle OUTPUT,
N'@Score int, @ActivationTime datetime, @Grp int, @BaseDate date',
N'INSERT INTO #Results WITH (TABLOCK) (Score, ActivationTime, Grp, BaseDate) VALUES (@Score, @ActivationTime, @Grp, @BaseDate)';
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
SET @BaseDate = @ActivationTime
WHILE @@FETCH_STATUS = 0 BEGIN
IF @BaseDate > DATEADD(dd, 5, CONVERT(DATE,@ActivationTime)) BEGIN
SELECT @BaseDate = @ActivationTime, @Grp += 1
END
EXEC sp_execute @Handle, @Score, @ActivationTime, @Grp, @BaseDate
FETCH NEXT FROM myCursor INTO @Score, @ActivationTime
END
CLOSE myCursor
DEALLOCATE myCursor
EXEC sp_unprepare @Handle
SELECT *
FROM #Results
ORDER BY Grp, Score DESC;
I tried using sp_prepare to speed the inner loop up but it made no difference.
With this data about 217k rows
DROP TABLE IF EXISTS #MyHead;
GO
select CRYPT_GEN_RANDOM(2) % 30 Score, convert(datetime,value) ActivationTime
into #MyHead
from dbo.DateRange('20230101','20230601','mi',1);
it runs in 16 seconds on my machine compared to about 5 seconds for Jeff's quirky update.
December 30, 2022 at 7:03 pm
Jeff Moden wrote:I added 1000 "sets" of 3 groups, each with 13 dates. The Cartesian products persisted in the execution plan, each of them doing reads of 39,078,039 rows.
Too bad because, like I said, that's some interesting code you wrote! Thanks again.
mind giving the code to generate those sets
It's very similar to the code that I used to generate the 5 extra sets except it uses "fnTally" to create the numbers for the DATEADD(). Here's the code...
INSERT INTO #MyHead
(Score,ActivationTime)
SELECT Score,ActivationTime = DATEADD(yy,N,ActivationTime)
FROM #MyHead
CROSS APPLY dbo.fnTally(1,1000)t
--Jeff Moden
Change is inevitable... Change for the better is not.
December 30, 2022 at 7:14 pm
thanks Jeff.
regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.
a index on activationtime also required for the cte part if volumes are significant.
small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.
(this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)
December 30, 2022 at 7:50 pm
@frederico_fonseca I apologize, I missed that you were doing a recursive loop there. My bad.
December 30, 2022 at 8:35 pm
thanks Jeff.
regarding the "performance and cartesian joins" - splitting the CTE to insert the BASE dates onto a temp table and then joining that temp table onto the main table will for sure solve the issue you noticed.
a index on activationtime also required for the cte part if volumes are significant.
small test I did with 259k rows on input, resulting in 1422 groups with the said temp table and index runs (fully) in 5 seconds.
(this on my laptop - 16 GB ram (only 8GB for SQL), 4 cores/8 threads, i7 and SSD disk)
Thanks, Frederico. I'll check it out.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 35 total)
You must be logged in to reply to this topic. Login to reply