Overlapping datetime ranges.

Question

Overlapping datetime ranges.

ben.brugman

SSChampion

Points: 13369
More actions
December 20, 2016 at 3:24 am

#401241

Overlap in ranges.
I have a set of data,rows which a start_time and a end_time.
Each row should get a New_starttime, so that periods do not overlap.
A period can not be split into two periods.
Example:
Periods (overlapping):
----1111111111----------------------------------------
-------2222-------------------------------------------
------------333333333---------------------------------
Periods with a New_startime, removing overlaps.
----1111111111---------------------------------------- New_starttime is equal to starttime
------------------------------------------------------ New_starttime is equal to endtime
--------------3333333--------------------------------- New-starttime is equal to endtime of 1
In this example the first period is kept 'complete'.
The second period effectively gets a duration of 0.
The third period get a new starttime and is effectively shortened.
-- See also the thread :
-- http://www.sqlservercentral.com/Forums/Topic1760478-3412-1.aspx
I have looked at that thread and am still working on it. I think that that is a good starting point, for both a set of sample data and a solution. I am posting this problem here, because their might allready be a 'canned' solution for this or this may be an interresting problem. I'll be working on the/a solution myself, I think I can come up with something.
So the question is the number of rows should remain the same, but with the New_starttime the overlap should be removed from the periods.
Thanks for your time and attention.
A solution which also works in 2005/2008 is preverred.
Ben
Edit:
The example uses periods which start and end exactly on a date.
In my situation the periods start and end on a exact 'time'.
Overlap is then defined as a overlapping period. (The overlap period large (a day) or very small (a microsecond), same times do not overlap.)
Sorry I missed this when posting this for the first time.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 76985 More actions · Answer 1

Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

sgmunson SSC Guru Points: 110640 More actions · Answer 2

drew.allen (12/20/2016)
Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?
Drew

Ben,

What Drew says, and, what is your actual objective in collapsing these time intervals? If we know the why, we may be far better able to help determine a useful approach. Just collapsing them without eliminating any zero duration events doesn't necessarily make much sense without some larger context in which there's a need to retain their existence.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Sergiy SSC Guru Points: 110208 More actions · Answer 3

Here is something to start with:

SELECT ID, FromValue, ToValue

INTO #Periods

FROM (

SELECT 1, 2.1, 8.0

UNION

SELECT 2, 3., 4.5

UNION

SELECT 3, 4., 10.

UNION

SELECT 4, 9., 20.

) I (ID, FromValue, ToValue)

SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)

FROM #Periods p1

INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue

GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue

DROP TABLE #Periods

_____________
Code for TallyGenerator

ben.brugman SSChampion Points: 13369 More actions · Answer 4

drew.allen (12/20/2016)
Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?
Drew

The question was asked by a collegue, so I am not totally informed of the goals.

But as I understood, the rows contain more information than just the period information.

So for some purposes the complete period is needed.

For some other purposes the periods should not overlap.

For example the a 'project' can consist of a number of rows. The rows represent resources. More resources are there for a period, perods can overlap. A resource can be a 'projectleader', but a any time there is only one main projectleader. Only when a main projectleader ends another resource becomes the leader.

(Sorry for 'refrasing' the problem in this way, but I can not show the actual situation here).

I have come up with a solution, I' work that out and add that as a reply.

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 5

All thanks for your time and attention.

I started of with the post:

http://www.sqlservercentral.com/Forums/Topic1760478-3412-1.aspx

And tried to construct a solution from the solution presented there.

That was a bad idea. Spend some time on that solution, but didn't get that correct.

Some nightrest did help, came up with a far simpler solution :

------------------------------------------------------------------------------------------

-- Solution --

------------------------------------------------------------------------------------------

-- stef ten bras (Aka Ben)

-- 20161222

-- sarb net fets

--

-- Table/column names :

-- TestTable Actual table.

-- StartDate Start of the period including starttime.

-- Enddate End of the period EXCLUDING end time.

-- PreMaxDate Temporary column.

-- NewStartDate New column with a new pseudo start of the period. This new pseudo start does not overlap with the previous period.

-- Nr Ordering number within a single subscriber..

--

-- Table A prepared outside the WITH, this gives a better performance.

-- Instead of this table, the table A can also be used directly in the WITH statement.

--

If exists (select * from information_schema.tables where table_name like 'A') drop table A

;

WITH

A AS (SELECT row_number() OVER (PARTITION BY SubscriberID ORDER BY StartDate) Nr, * FROM TestTable)

select * into A from A

alter table A alter column subscriberid int not null

alter table A alter column nr INT not null

GO

ALTER TABLE A ADD CONSTRAINT PK_A PRIMARY KEY CLUSTERED (SubscriberID,nr);

-- SELECT * FROM a

If exists (select * from tempdb.information_schema.tables where table_name like '##N') drop table ##N

;

WITH

--

-- Use the prepared A for efficeiency. (Or use the commented out version).

--

--A AS (SELECT row_number() OVER (PARTITION BY SubscriberID ORDER BY StartDate) Nr, * FROM TestTable)

--,

CTE as (SELECT CONVERT(datetime, '1900-01-01') PreMaxDate, * FROM A where Nr = 1

union all

SELECT

CASE WHEN s.PreMaxDate >= s.EndDate THEN s.PreMaxDate

ELSE s.EndDate

END PreMaxDate

, E.* FROM CTE S join a E on S.Nr+1 = E.nr and S.SubscriberID = E.SubscriberID)

, D as (SELECT

CASE WHEN Nr= 1 THEN StartDate

WHEN PreMaxDate >= StartDate THEN PreMaxDate

ELSE StartDate

END NewStartDate

, * FROM CTE )

SELECT * into ##N FROM D ORDER BY SubscriberID, StartDate OPTION (MAXRECURSION 0)

select * from ##N

--

-- Check Result, there should be no gapsizes less than 0.

--

SELECT COUNT(*) Number_of_rows_TestTable FROM TestTable

SELECT COUNT(*) Number_of_rows_should_be_the_same_as_in_testtable FROM ##N

;

With

A AS (select * from ##N)

, B as (select P.EndDate pENDDATE, R.* FROM A p RIGHT JOIN A R ON P.NR+1 = R.NR AND P.Subscriberid = R.Subscriberid)

SELECT CONVERT(int, newstartdate-penddate) gapsize,* FROM b

Preparing Table A gave a significant better performance.

Jeff Moden thanks for the datagenerating code in the thread. (Ben Itzik thanks for the islands solution).

All thanks,

Have a Very Merry Chrismas,

Ben

Taken from the example of Jeff Moden en Itzik (Thanks), the datageneration.

Small adjustments are, the endtime is 'corrected' for NOT including the endday (or endtime if time is used).

And some 'noise' is generated for forcing that the startdate's are not 'sorted'.

--

--=====================================================================================================================

-- Create up to 10 million row test table.

--=====================================================================================================================

--===== If the test table exists, drop it to make reruns in SSMS easier

SET NOCOUNT ON;

IF OBJECT_ID(N'TestTable') IS NOT NULL DROP TABLE TestTable

;

GO

--===== Create and populate a 10 million row test table on the fly.

-- 1,000,000 random IDs with ~10 random date spans of 0 to 14 days each

Declare @widestring varchar(4000) = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

SET @widestring = @widestring+@widestring+@widestring+@widestring

-- Parameterize

DECLARE @No_subscriptions int = 10000000

DECLARE @No_subscripberids int = 1000000

DECLARE @STARTDATE_TEST datetime = '2000' -- start of the subscribtion after.

DECLARE @ENDDATE_TEST datetime = '20160310' -- start of the subscription before.

DECLARE @SPAN_TEST INT = 15 -- Max length (days for timespan of a subscription

DECLARE @REALISM_TEST INT = 36 -- GENERATE SOME 'NOISE' FOR ORDERING IN startdate.

-- second set

SET @No_subscriptions = 1000

SET @No_subscripberids = 10

--SET @No_subscriptions = 1000000

--SET @No_subscripberids = 10000

SET @STARTDATE_TEST = '2015' -- start of the subscribtion after.

SET @ENDDATE_TEST = '2016' -- start of the subscription before.

SET @SPAN_TEST = 10 -- Max length (days for timespan of a subscription

SET @REALISM_TEST = 36 -- GENERATE SOME 'NOISE' FOR ORDERING IN startdate.

exec sp_timing 'Aanmaken TestTable'

;

WITH

cteGenDates AS

(

SELECT TOP (@No_subscriptions) -- 10E6 rows

SubscriberID = ABS(CHECKSUM(NEWID()))%@No_subscripberids+1

,StartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%DATEDIFF(dd,@STARTDATE_TEST,@ENDDATE_TEST),@STARTDATE_TEST)

,Span = ABS(CHECKSUM(NEwID()))%(@SPAN_TEST) -- period max 15 days.

,realism = ABS(CHECKSUM(NEwID()))%(@REALISM_TEST) -- period max 36 days. (For (spread) sorting only)

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

)

SELECT SubscriptionID = IDENTITY(INT,1,1)

,SubscriberID

,StartDate

,EndDate = DATEADD(dd,Span,StartDate)

,substring(@widestring,ABS(CHECKSUM(NEWID()))%400, 999) wide

INTO TestTable

FROM cteGenDates

ORDER BY (StartDate + realism) --For a bit a realism

;

alter table testtable alter column subscriberid int not null

--===== Create the expected/needed indexes

ALTER TABLE TestTable ADD CONSTRAINT PK_TestTable PRIMARY KEY CLUSTERED (SubscriberID,SubscriptionID);

CREATE UNIQUE INDEX IX_Begin_Unique ON TestTable(SubscriberID, StartDate, SubscriptionID);

CREATE UNIQUE INDEX IX_End_Unique ON TestTable(SubscriberID, EndDate , SubscriptionID);

CREATE UNIQUE INDEX IX_cover ON TestTable(EndDate ,StartDate, SubscriberID, SubscriptionID);

--

-- Enddate is 'pushed' one day, to ensure that the enddate is not inclusive.

-- The period is from including to the enddate excluding.

--

update testtable set enddate = enddate+1

ben.brugman SSChampion Points: 13369 More actions · Answer 6

Sergiy (12/20/2016)
Here is something to start with:
SELECT ID, FromValue, ToValue
INTO #Periods
FROM (
SELECT 1, 2.1, 8.0
UNION
SELECT 2, 3., 4.5
UNION
SELECT 3, 4., 10.
UNION
SELECT 4, 9., 20.
) I (ID, FromValue, ToValue)
SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)
FROM #Periods p1
INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue
GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue
DROP TABLE #Periods

Thanks, I did rewrite the code to fit the example data.

SELECT P2.subscriberID, P2.StartDate, MAX(P1.Enddate) newFromValue, P2.Enddate, MAX(CASE WHEN P2.Enddate < P1.Enddate THEN P1.Enddate ELSE P2.Enddate END)

FROM TestTable p1 INNER JOIN TestTable p2

ON P2.StartDate > P1.StartDate AND p2.StartDate < p1.Enddate and p1.subscriberid = p2.SubscriberID

GROUP BY P2.Enddate, P2.SubscriberID, P2.StartDate, P2.Enddate

order by SubscriberID,startdate

But this does not give the correct number of rows.

And the data within the rows (newFromValue) does not make sense to me.

Because of the group by data gets 'lost'.

So for me; this looks not usable.

Thanks,

Have a very Merry Christmas,

Ben

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 7

This should be a simpler & faster solution as it only takes a single pass over the data...

-- Setup some test data...

IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL

DROP TABLE #TestData;

CREATE TABLE #TestData (

ID INT NOT NULL PRIMARY KEY,

BegDate DATE NOT NULL,

EndDate DATE NOT NULL

);

INSERT #TestData (ID, BegDate, EndDate) VALUES

(1, '2016-01-01', '2016-01-10'),

(2, '2016-01-04', '2016-01-07'),

(3, '2016-01-09', '2016-01-17'),

(4, '2016-02-01', '2016-02-20'),

(5, '2016-02-05', '2016-02-25'),

(6, '2016-01-15', '2016-02-10');

-- Create a "POC" index to get rid of the sort that would otherwise be

-- generated by the window function.

CREATE NONCLUSTERED INDEX ix_TestData_POC ON #TestData (BegDate, EndDate, ID);

--============================================================================

-- The actual solution...

WITH

cte_AlteredValues AS (

SELECT

td.ID,

td.BegDate,

td.EndDate,

MaxPrevEnd = MAX(td.EndDate) OVER (ORDER BY td.BegDate, td.EndDate, td.ID ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

FROM

#TestData td

)

SELECT

av.ID,

av.BegDate,

av.EndDate,

[.] = '||',

NewBeg = CASE WHEN av.BegDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.BegDate END,

NewEnd = CASE WHEN av.EndDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.EndDate END

FROM

cte_AlteredValues av;

Results...

ID BegDate EndDate . NewBeg NewEnd

----------- ---------- ---------- ---- ---------- ----------

1 2016-01-01 2016-01-10 || 2016-01-01 2016-01-10

2 2016-01-04 2016-01-07 || 2016-01-11 2016-01-11

3 2016-01-09 2016-01-17 || 2016-01-11 2016-01-17

6 2016-01-15 2016-02-10 || 2016-01-18 2016-02-10

4 2016-02-01 2016-02-20 || 2016-02-11 2016-02-20

5 2016-02-05 2016-02-25 || 2016-02-21 2016-02-25

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 8

Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)

Sergiy SSC Guru Points: 110208 More actions · Answer 9

ben.brugman (12/22/2016)
Sergiy (12/20/2016)
Here is something to start with:
SELECT ID, FromValue, ToValue
INTO #Periods
FROM (
SELECT 1, 2.1, 8.0
UNION
SELECT 2, 3., 4.5
UNION
SELECT 3, 4., 10.
UNION
SELECT 4, 9., 20.
) I (ID, FromValue, ToValue)
SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)
FROM #Periods p1
INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue
GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue
DROP TABLE #Periods
Thanks, I did rewrite the code to fit the example data.
SELECT P2.subscriberID, P2.StartDate, MAX(P1.Enddate) newFromValue, P2.Enddate, MAX(CASE WHEN P2.Enddate < P1.Enddate THEN P1.Enddate ELSE P2.Enddate END)
FROM TestTable p1 INNER JOIN TestTable p2
ON P2.StartDate > P1.StartDate AND p2.StartDate < p1.Enddate and p1.subscriberid = p2.SubscriberID
GROUP BY P2.Enddate, P2.SubscriberID, P2.StartDate, P2.Enddate
order by SubscriberID,startdate
But this does not give the correct number of rows.
And the data within the rows (newFromValue) does not make sense to me.
Because of the group by data gets 'lost'.
So for me; this looks not usable.
Thanks,
Have a very Merry Christmas,
Ben

Can you post your TestTable definition?

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 10

ben.brugman (12/22/2016)
But this does not give the correct number of rows.
And the data within the rows (newFromValue) does not make sense to me.
Because of the group by data gets 'lost'.
So for me; this looks not usable.
Thanks,
Have a very Merry Christmas,
Ben

As I said - that was something for you to start with.

The query returns only the records which need to be changed.

It's not because of GROUP BY, but because of INNER JOIN.

"newFromValue" is what should be in FromValue for updated rows after the adjustment.

Same for "newToValue".

Try to change INNER JOIN to one of OUTER JOINs in my query and see if it appears more usable for you.

_____________
Code for TallyGenerator

ben.brugman SSChampion Points: 13369 More actions · Answer 11

Sergiy (12/22/2016)

Can you post your TestTable definition?

The testtable definition was in the post before my previous post.

At the start of that code the text is : -- Create up to 10 million row test table.

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 12

Jason A. Long (12/22/2016)
Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)

Exactly what was asked, thank you.

I am at home enjoying my Christmas holidays. At home I only have version 2008 installed.

So I haven't used your solution yet. And for me this is some new syntax to learn. Thanks for educating me. I have to study the code to understand it and if I do not use it this time, it will be a valuable addition to my toolbox.

Thanks, happy holidays, happy Christmas to all of you,

Ben

Edit:

Did take a better look at the code. I am not familiar with the "BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" part.

But I do think I do understand the code. (But still have to try this code)

Remarks :

I do not like the BETWEEN in SQL because between in general it is scale dependend. I like to write the code in such a way that the code should work if the starttime / endtime are done in days, minutes, seconds, months or any other scale. The BETWEEN rarely does work scale independend. (But still have to figure that out for this situation).

The second indication for me that this is not scale independend is the add a day in the next part.

Within the hour of the first message I did an edit explaining that the actual 'problem' was not on the scale of days, but on a time scale, I should have been more clear about this. (Sorry).

In General when working with ranges I normally include the beginning and exclude the end.

So If I say from 2 to 7 kilometers, I could also say from 2000 to 7000 meters. Same with datetime formats.

If I say from 15 to 16 hours, I could also say from 15:00:00.000 to 16:00:00.000. With between this works differently, so I prefer include the begin time and to exclude the end 'timestamp'.

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 13

ben.brugman (12/23/2016)
Jason A. Long (12/22/2016)
Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)
Exactly what was asked, thank you.
I am at home enjoying my Christmas holidays. At home I only have version 2008 installed.
So I haven't used your solution yet. And for me this is some new syntax to learn. Thanks for educating me. I have to study the code to understand it and if I do not use it this time, it will be a valuable addition to my toolbox.
Thanks, happy holidays, happy Christmas to all of you,
Ben

No problem. This was a fun exercise. Happy Holidays & Merry Christmas to you as well. 🙂

Edit:
Did take a better look at the code. I am not familiar with the "BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" part.
But I do think I do understand the code. (But still have to try this code)

This is called a "window frame"... AKA framing a section of data for a windowed function.

It simply says, "Order the rows by the criteria in the ORDER BY clause and then look at all of the rows before this row (UNBOUNDED PRECEDING) and do not include the current row (1 PRECEDING)".

It's a new(ish) syntax that was added to SQL Server in 2012.

Itzik Ben-Gan does an superb job of covering it in his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions

He also has a 3 part blog seris on the subject...

Part 1

Part 2

Part 3

Remarks :
I do not like the BETWEEN in SQL because between in general it is scale dependend. I like to write the code in such a way that the code should work if the starttime / endtime are done in days, minutes, seconds, months or any other scale. The BETWEEN rarely does work scale independend. (But still have to figure that out for this situation).
The second indication for me that this is not scale independend is the add a day in the next part.
Within the hour of the first message I did an edit explaining that the actual 'problem' was not on the scale of days, but on a time scale, I should have been more clear about this. (Sorry).
In General when working with ranges I normally include the beginning and exclude the end.
So If I say from 2 to 7 kilometers, I could also say from 2000 to 7000 meters. Same with datetime formats.
If I say from 15 to 16 hours, I could also say from 15:00:00.000 to 16:00:00.000. With between this works differently, so I prefer include the begin time and to exclude the end 'timestamp'.

I agree with you about using BETWEEN in the WHERE clause but this is a different animal altogether.

When using window frames, you're a little more restricted when it come to the syntax you can use. In fact, learning the syntax can be a bit of a challenge due to the fact that it's not what we're used to. That said, it's extremely powerful and well worth learning. (at least it isn't anywhere near as bad as XML.)

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 14

The code works of a simple idea...

If you order the rows by start date, the end of early rows controls the begin dates of all subsequent rows.

If the end date of a subsequent row comes before the end date of a previous row, it is effectively collapsed into nothing (the begin and end are equal).

Here is the solution part w/ some additional comments...

WITH

cte_AlteredValues AS (

SELECT

td.ID,

td.BegDate,

td.EndDate,

-- get the MAX(EndDate) of all rows that have come before the current row (ordered by td.BegDate, td.EndDate, td.ID)

-- "ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" = Looks at all rows that came before the current row while excluding the current row.

-- UNBOUNDED PRECEDING = go to the very begining.

-- 1 PRECEDING = stop at the row just before the current row.

MaxPrevEnd = MAX(td.EndDate) OVER (ORDER BY td.BegDate, td.EndDate, td.ID ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

FROM

#TestData td

)

SELECT

av.ID,

av.BegDate,

av.EndDate,

[.] = '||',

-- If the current BegDate is <= MaxPrevEnd then add a day to MaxPrevEnd and make that the new current BegDate... If not, use the existing current BegDate.

NewBeg = CASE WHEN av.BegDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.BegDate END,

-- If the current EndDate is <= MaxPrevEnd then add a day to MaxPrevEnd and make that the new current EndDate... If not, use the existing current EndDate.

NewEnd = CASE WHEN av.EndDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.EndDate END

-- If the net result of the two CASE expressions is that BegDate = EndDate, you know that it has been reduced to nothingness.

FROM

cte_AlteredValues av;