Overlapping datetime ranges.

  • Overlap in ranges.

    I have a set of data,rows which a start_time and a end_time.

    Each row should get a New_starttime, so that periods do not overlap.

    A period can not be split into two periods.

    Example:

    Periods (overlapping):

    ----1111111111----------------------------------------

    -------2222-------------------------------------------

    ------------333333333---------------------------------

    Periods with a New_startime, removing overlaps.

    ----1111111111---------------------------------------- New_starttime is equal to starttime

    ------------------------------------------------------ New_starttime is equal to endtime

    --------------3333333--------------------------------- New-starttime is equal to endtime of 1

    In this example the first period is kept 'complete'.

    The second period effectively gets a duration of 0.

    The third period get a new starttime and is effectively shortened.

    -- See also the thread :

    -- http://www.sqlservercentral.com/Forums/Topic1760478-3412-1.aspx

    I have looked at that thread and am still working on it. I think that that is a good starting point, for both a set of sample data and a solution. I am posting this problem here, because their might allready be a 'canned' solution for this or this may be an interresting problem. I'll be working on the/a solution myself, I think I can come up with something.

    So the question is the number of rows should remain the same, but with the New_starttime the overlap should be removed from the periods.

    Thanks for your time and attention.

    A solution which also works in 2005/2008 is preverred.

    Ben

    Edit:

    The example uses periods which start and end exactly on a date.

    In my situation the periods start and end on a exact 'time'.

    Overlap is then defined as a overlapping period. (The overlap period large (a day) or very small (a microsecond), same times do not overlap.)

    Sorry I missed this when posting this for the first time.

  • Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (12/20/2016)


    Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?

    Drew

    Ben,

    What Drew says, and, what is your actual objective in collapsing these time intervals? If we know the why, we may be far better able to help determine a useful approach. Just collapsing them without eliminating any zero duration events doesn't necessarily make much sense without some larger context in which there's a need to retain their existence.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Here is something to start with:

    SELECT ID, FromValue, ToValue

    INTO #Periods

    FROM (

    SELECT 1, 2.1, 8.0

    UNION

    SELECT 2, 3., 4.5

    UNION

    SELECT 3, 4., 10.

    UNION

    SELECT 4, 9., 20.

    ) I (ID, FromValue, ToValue)

    SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)

    FROM #Periods p1

    INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue

    GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue

    DROP TABLE #Periods

    _____________
    Code for TallyGenerator

  • drew.allen (12/20/2016)


    Most approaches that I've seen collapse the ranges. That is, they don't retain a record with a zero length. Is there a particular reason that you need to retain the zero length?

    Drew

    The question was asked by a collegue, so I am not totally informed of the goals.

    But as I understood, the rows contain more information than just the period information.

    So for some purposes the complete period is needed.

    For some other purposes the periods should not overlap.

    For example the a 'project' can consist of a number of rows. The rows represent resources. More resources are there for a period, perods can overlap. A resource can be a 'projectleader', but a any time there is only one main projectleader. Only when a main projectleader ends another resource becomes the leader.

    (Sorry for 'refrasing' the problem in this way, but I can not show the actual situation here).

    I have come up with a solution, I' work that out and add that as a reply.

    Ben

  • All thanks for your time and attention.

    I started of with the post:

    http://www.sqlservercentral.com/Forums/Topic1760478-3412-1.aspx

    And tried to construct a solution from the solution presented there.

    That was a bad idea. Spend some time on that solution, but didn't get that correct.

    Some nightrest did help, came up with a far simpler solution :

    ------------------------------------------------------------------------------------------

    ------------------------------------------------------------------------------------------

    -- Solution --

    ------------------------------------------------------------------------------------------

    ------------------------------------------------------------------------------------------

    -- stef ten bras (Aka Ben)

    -- 20161222

    -- sarb net fets

    --

    -- Table/column names :

    -- TestTable Actual table.

    -- StartDate Start of the period including starttime.

    -- Enddate End of the period EXCLUDING end time.

    -- PreMaxDate Temporary column.

    -- NewStartDate New column with a new pseudo start of the period. This new pseudo start does not overlap with the previous period.

    -- Nr Ordering number within a single subscriber..

    --

    --

    -- Table A prepared outside the WITH, this gives a better performance.

    -- Instead of this table, the table A can also be used directly in the WITH statement.

    --

    If exists (select * from information_schema.tables where table_name like 'A') drop table A

    ;

    WITH

    A AS (SELECT row_number() OVER (PARTITION BY SubscriberID ORDER BY StartDate) Nr, * FROM TestTable)

    select * into A from A

    alter table A alter column subscriberid int not null

    alter table A alter column nr INT not null

    GO

    ALTER TABLE A ADD CONSTRAINT PK_A PRIMARY KEY CLUSTERED (SubscriberID,nr);

    -- SELECT * FROM a

    If exists (select * from tempdb.information_schema.tables where table_name like '##N') drop table ##N

    ;

    WITH

    --

    -- Use the prepared A for efficeiency. (Or use the commented out version).

    --

    --A AS (SELECT row_number() OVER (PARTITION BY SubscriberID ORDER BY StartDate) Nr, * FROM TestTable)

    --,

    CTE as (SELECT CONVERT(datetime, '1900-01-01') PreMaxDate, * FROM A where Nr = 1

    union all

    SELECT

    CASE WHEN s.PreMaxDate >= s.EndDate THEN s.PreMaxDate

    ELSE s.EndDate

    END PreMaxDate

    , E.* FROM CTE S join a E on S.Nr+1 = E.nr and S.SubscriberID = E.SubscriberID)

    , D as (SELECT

    CASE WHEN Nr= 1 THEN StartDate

    WHEN PreMaxDate >= StartDate THEN PreMaxDate

    ELSE StartDate

    END NewStartDate

    , * FROM CTE )

    SELECT * into ##N FROM D ORDER BY SubscriberID, StartDate OPTION (MAXRECURSION 0)

    select * from ##N

    --

    -- Check Result, there should be no gapsizes less than 0.

    --

    SELECT COUNT(*) Number_of_rows_TestTable FROM TestTable

    SELECT COUNT(*) Number_of_rows_should_be_the_same_as_in_testtable FROM ##N

    ;

    With

    A AS (select * from ##N)

    , B as (select P.EndDate pENDDATE, R.* FROM A p RIGHT JOIN A R ON P.NR+1 = R.NR AND P.Subscriberid = R.Subscriberid)

    SELECT CONVERT(int, newstartdate-penddate) gapsize,* FROM b

    Preparing Table A gave a significant better performance.

    Jeff Moden thanks for the datagenerating code in the thread. (Ben Itzik thanks for the islands solution).

    All thanks,

    Have a Very Merry Chrismas,

    Ben

    Taken from the example of Jeff Moden en Itzik (Thanks), the datageneration.

    Small adjustments are, the endtime is 'corrected' for NOT including the endday (or endtime if time is used).

    And some 'noise' is generated for forcing that the startdate's are not 'sorted'.

    --

    --=====================================================================================================================

    -- Create up to 10 million row test table.

    --=====================================================================================================================

    --===== If the test table exists, drop it to make reruns in SSMS easier

    SET NOCOUNT ON;

    IF OBJECT_ID(N'TestTable') IS NOT NULL DROP TABLE TestTable

    ;

    GO

    --===== Create and populate a 10 million row test table on the fly.

    -- 1,000,000 random IDs with ~10 random date spans of 0 to 14 days each

    Declare @widestring varchar(4000) = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

    SET @widestring = @widestring+@widestring+@widestring+@widestring

    SET @widestring = @widestring+@widestring+@widestring+@widestring

    -- Parameterize

    DECLARE @No_subscriptions int = 10000000

    DECLARE @No_subscripberids int = 1000000

    DECLARE @STARTDATE_TEST datetime = '2000' -- start of the subscribtion after.

    DECLARE @ENDDATE_TEST datetime = '20160310' -- start of the subscription before.

    DECLARE @SPAN_TEST INT = 15 -- Max length (days for timespan of a subscription

    DECLARE @REALISM_TEST INT = 36 -- GENERATE SOME 'NOISE' FOR ORDERING IN startdate.

    -- second set

    SET @No_subscriptions = 1000

    SET @No_subscripberids = 10

    --SET @No_subscriptions = 1000000

    --SET @No_subscripberids = 10000

    SET @STARTDATE_TEST = '2015' -- start of the subscribtion after.

    SET @ENDDATE_TEST = '2016' -- start of the subscription before.

    SET @SPAN_TEST = 10 -- Max length (days for timespan of a subscription

    SET @REALISM_TEST = 36 -- GENERATE SOME 'NOISE' FOR ORDERING IN startdate.

    exec sp_timing 'Aanmaken TestTable'

    ;

    WITH

    cteGenDates AS

    (

    SELECT TOP (@No_subscriptions) -- 10E6 rows

    SubscriberID = ABS(CHECKSUM(NEWID()))%@No_subscripberids+1

    ,StartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%DATEDIFF(dd,@STARTDATE_TEST,@ENDDATE_TEST),@STARTDATE_TEST)

    ,Span = ABS(CHECKSUM(NEwID()))%(@SPAN_TEST) -- period max 15 days.

    ,realism = ABS(CHECKSUM(NEwID()))%(@REALISM_TEST) -- period max 36 days. (For (spread) sorting only)

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    SELECT SubscriptionID = IDENTITY(INT,1,1)

    ,SubscriberID

    ,StartDate

    ,EndDate = DATEADD(dd,Span,StartDate)

    ,substring(@widestring,ABS(CHECKSUM(NEWID()))%400, 999) wide

    INTO TestTable

    FROM cteGenDates

    ORDER BY (StartDate + realism) --For a bit a realism

    ;

    alter table testtable alter column subscriberid int not null

    --===== Create the expected/needed indexes

    ALTER TABLE TestTable ADD CONSTRAINT PK_TestTable PRIMARY KEY CLUSTERED (SubscriberID,SubscriptionID);

    CREATE UNIQUE INDEX IX_Begin_Unique ON TestTable(SubscriberID, StartDate, SubscriptionID);

    CREATE UNIQUE INDEX IX_End_Unique ON TestTable(SubscriberID, EndDate , SubscriptionID);

    CREATE UNIQUE INDEX IX_cover ON TestTable(EndDate ,StartDate, SubscriberID, SubscriptionID);

    --

    -- Enddate is 'pushed' one day, to ensure that the enddate is not inclusive.

    -- The period is from including to the enddate excluding.

    --

    update testtable set enddate = enddate+1

  • Sergiy (12/20/2016)


    Here is something to start with:

    SELECT ID, FromValue, ToValue

    INTO #Periods

    FROM (

    SELECT 1, 2.1, 8.0

    UNION

    SELECT 2, 3., 4.5

    UNION

    SELECT 3, 4., 10.

    UNION

    SELECT 4, 9., 20.

    ) I (ID, FromValue, ToValue)

    SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)

    FROM #Periods p1

    INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue

    GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue

    DROP TABLE #Periods

    Thanks, I did rewrite the code to fit the example data.

    SELECT P2.subscriberID, P2.StartDate, MAX(P1.Enddate) newFromValue, P2.Enddate, MAX(CASE WHEN P2.Enddate < P1.Enddate THEN P1.Enddate ELSE P2.Enddate END)

    FROM TestTable p1 INNER JOIN TestTable p2

    ON P2.StartDate > P1.StartDate AND p2.StartDate < p1.Enddate and p1.subscriberid = p2.SubscriberID

    GROUP BY P2.Enddate, P2.SubscriberID, P2.StartDate, P2.Enddate

    order by SubscriberID,startdate

    But this does not give the correct number of rows.

    And the data within the rows (newFromValue) does not make sense to me.

    Because of the group by data gets 'lost'.

    So for me; this looks not usable.

    Thanks,

    Have a very Merry Christmas,

    Ben

  • This should be a simpler & faster solution as it only takes a single pass over the data...

    -- Setup some test data...

    IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL

    DROP TABLE #TestData;

    CREATE TABLE #TestData (

    ID INT NOT NULL PRIMARY KEY,

    BegDate DATE NOT NULL,

    EndDate DATE NOT NULL

    );

    INSERT #TestData (ID, BegDate, EndDate) VALUES

    (1, '2016-01-01', '2016-01-10'),

    (2, '2016-01-04', '2016-01-07'),

    (3, '2016-01-09', '2016-01-17'),

    (4, '2016-02-01', '2016-02-20'),

    (5, '2016-02-05', '2016-02-25'),

    (6, '2016-01-15', '2016-02-10');

    -- Create a "POC" index to get rid of the sort that would otherwise be

    -- generated by the window function.

    CREATE NONCLUSTERED INDEX ix_TestData_POC ON #TestData (BegDate, EndDate, ID);

    --============================================================================

    -- The actual solution...

    WITH

    cte_AlteredValues AS (

    SELECT

    td.ID,

    td.BegDate,

    td.EndDate,

    MaxPrevEnd = MAX(td.EndDate) OVER (ORDER BY td.BegDate, td.EndDate, td.ID ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

    FROM

    #TestData td

    )

    SELECT

    av.ID,

    av.BegDate,

    av.EndDate,

    [.] = '||',

    NewBeg = CASE WHEN av.BegDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.BegDate END,

    NewEnd = CASE WHEN av.EndDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.EndDate END

    FROM

    cte_AlteredValues av;

    Results...

    ID BegDate EndDate . NewBeg NewEnd

    ----------- ---------- ---------- ---- ---------- ----------

    1 2016-01-01 2016-01-10 || 2016-01-01 2016-01-10

    2 2016-01-04 2016-01-07 || 2016-01-11 2016-01-11

    3 2016-01-09 2016-01-17 || 2016-01-11 2016-01-17

    6 2016-01-15 2016-02-10 || 2016-01-18 2016-02-10

    4 2016-02-01 2016-02-20 || 2016-02-11 2016-02-20

    5 2016-02-05 2016-02-25 || 2016-02-21 2016-02-25

  • Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)

  • ben.brugman (12/22/2016)


    Sergiy (12/20/2016)


    Here is something to start with:

    SELECT ID, FromValue, ToValue

    INTO #Periods

    FROM (

    SELECT 1, 2.1, 8.0

    UNION

    SELECT 2, 3., 4.5

    UNION

    SELECT 3, 4., 10.

    UNION

    SELECT 4, 9., 20.

    ) I (ID, FromValue, ToValue)

    SELECT P2.ID, P2.FromValue, MAX(P1.ToValue) newFromValue, P2.ToValue, MAX(CASE WHEN P2.ToValue < P1.ToValue THEN P1.ToValue ELSE P2.ToValue END)

    FROM #Periods p1

    INNER JOIN #Periods p2 ON P2.FromValue > P1.FromValue AND p2.FromValue < p1.ToValue

    GROUP BY P2.ToValue, P2.ID, P2.FromValue, P2.ToValue

    DROP TABLE #Periods

    Thanks, I did rewrite the code to fit the example data.

    SELECT P2.subscriberID, P2.StartDate, MAX(P1.Enddate) newFromValue, P2.Enddate, MAX(CASE WHEN P2.Enddate < P1.Enddate THEN P1.Enddate ELSE P2.Enddate END)

    FROM TestTable p1 INNER JOIN TestTable p2

    ON P2.StartDate > P1.StartDate AND p2.StartDate < p1.Enddate and p1.subscriberid = p2.SubscriberID

    GROUP BY P2.Enddate, P2.SubscriberID, P2.StartDate, P2.Enddate

    order by SubscriberID,startdate

    But this does not give the correct number of rows.

    And the data within the rows (newFromValue) does not make sense to me.

    Because of the group by data gets 'lost'.

    So for me; this looks not usable.

    Thanks,

    Have a very Merry Christmas,

    Ben

    Can you post your TestTable definition?

    _____________
    Code for TallyGenerator

  • ben.brugman (12/22/2016)


    But this does not give the correct number of rows.

    And the data within the rows (newFromValue) does not make sense to me.

    Because of the group by data gets 'lost'.

    So for me; this looks not usable.

    Thanks,

    Have a very Merry Christmas,

    Ben

    As I said - that was something for you to start with.

    The query returns only the records which need to be changed.

    It's not because of GROUP BY, but because of INNER JOIN.

    "newFromValue" is what should be in FromValue for updated rows after the adjustment.

    Same for "newToValue".

    Try to change INNER JOIN to one of OUTER JOINs in my query and see if it appears more usable for you.

    _____________
    Code for TallyGenerator

  • Sergiy (12/22/2016)


    Can you post your TestTable definition?

    The testtable definition was in the post before my previous post.

    At the start of that code the text is : -- Create up to 10 million row test table.

    Ben

  • Jason A. Long (12/22/2016)


    Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)

    Exactly what was asked, thank you.

    I am at home enjoying my Christmas holidays. At home I only have version 2008 installed.

    So I haven't used your solution yet. And for me this is some new syntax to learn. Thanks for educating me. I have to study the code to understand it and if I do not use it this time, it will be a valuable addition to my toolbox.

    Thanks, happy holidays, happy Christmas to all of you,

    Ben

    Edit:

    Did take a better look at the code. I am not familiar with the "BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" part.

    But I do think I do understand the code. (But still have to try this code)

    Remarks :

    I do not like the BETWEEN in SQL because between in general it is scale dependend. I like to write the code in such a way that the code should work if the starttime / endtime are done in days, minutes, seconds, months or any other scale. The BETWEEN rarely does work scale independend. (But still have to figure that out for this situation).

    The second indication for me that this is not scale independend is the add a day in the next part.

    Within the hour of the first message I did an edit explaining that the actual 'problem' was not on the scale of days, but on a time scale, I should have been more clear about this. (Sorry).

    In General when working with ranges I normally include the beginning and exclude the end.

    So If I say from 2 to 7 kilometers, I could also say from 2000 to 7000 meters. Same with datetime formats.

    If I say from 15 to 16 hours, I could also say from 15:00:00.000 to 16:00:00.000. With between this works differently, so I prefer include the begin time and to exclude the end 'timestamp'.

  • ben.brugman (12/23/2016)


    Jason A. Long (12/22/2016)


    Here's what the solution looks like laid out graphically in Excel (note: I change a few dates to compact the ranges...)

    Exactly what was asked, thank you.

    I am at home enjoying my Christmas holidays. At home I only have version 2008 installed.

    So I haven't used your solution yet. And for me this is some new syntax to learn. Thanks for educating me. I have to study the code to understand it and if I do not use it this time, it will be a valuable addition to my toolbox.

    Thanks, happy holidays, happy Christmas to all of you,

    Ben

    No problem. This was a fun exercise. Happy Holidays & Merry Christmas to you as well. 🙂

    Edit:

    Did take a better look at the code. I am not familiar with the "BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" part.

    But I do think I do understand the code. (But still have to try this code)

    This is called a "window frame"... AKA framing a section of data for a windowed function.

    It simply says, "Order the rows by the criteria in the ORDER BY clause and then look at all of the rows before this row (UNBOUNDED PRECEDING) and do not include the current row (1 PRECEDING)".

    It's a new(ish) syntax that was added to SQL Server in 2012.

    Itzik Ben-Gan does an superb job of covering it in his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions

    He also has a 3 part blog seris on the subject...

    Part 1

    Part 2

    Part 3

    Remarks :

    I do not like the BETWEEN in SQL because between in general it is scale dependend. I like to write the code in such a way that the code should work if the starttime / endtime are done in days, minutes, seconds, months or any other scale. The BETWEEN rarely does work scale independend. (But still have to figure that out for this situation).

    The second indication for me that this is not scale independend is the add a day in the next part.

    Within the hour of the first message I did an edit explaining that the actual 'problem' was not on the scale of days, but on a time scale, I should have been more clear about this. (Sorry).

    In General when working with ranges I normally include the beginning and exclude the end.

    So If I say from 2 to 7 kilometers, I could also say from 2000 to 7000 meters. Same with datetime formats.

    If I say from 15 to 16 hours, I could also say from 15:00:00.000 to 16:00:00.000. With between this works differently, so I prefer include the begin time and to exclude the end 'timestamp'.

    I agree with you about using BETWEEN in the WHERE clause but this is a different animal altogether.

    When using window frames, you're a little more restricted when it come to the syntax you can use. In fact, learning the syntax can be a bit of a challenge due to the fact that it's not what we're used to. That said, it's extremely powerful and well worth learning. (at least it isn't anywhere near as bad as XML.)

  • The code works of a simple idea...

    If you order the rows by start date, the end of early rows controls the begin dates of all subsequent rows.

    If the end date of a subsequent row comes before the end date of a previous row, it is effectively collapsed into nothing (the begin and end are equal).

    Here is the solution part w/ some additional comments...

    WITH

    cte_AlteredValues AS (

    SELECT

    td.ID,

    td.BegDate,

    td.EndDate,

    -- get the MAX(EndDate) of all rows that have come before the current row (ordered by td.BegDate, td.EndDate, td.ID)

    -- "ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING" = Looks at all rows that came before the current row while excluding the current row.

    -- UNBOUNDED PRECEDING = go to the very begining.

    -- 1 PRECEDING = stop at the row just before the current row.

    MaxPrevEnd = MAX(td.EndDate) OVER (ORDER BY td.BegDate, td.EndDate, td.ID ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

    FROM

    #TestData td

    )

    SELECT

    av.ID,

    av.BegDate,

    av.EndDate,

    [.] = '||',

    -- If the current BegDate is <= MaxPrevEnd then add a day to MaxPrevEnd and make that the new current BegDate... If not, use the existing current BegDate.

    NewBeg = CASE WHEN av.BegDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.BegDate END,

    -- If the current EndDate is <= MaxPrevEnd then add a day to MaxPrevEnd and make that the new current EndDate... If not, use the existing current EndDate.

    NewEnd = CASE WHEN av.EndDate <= av.MaxPrevEnd THEN DATEADD(dd, 1, av.MaxPrevEnd) ELSE av.EndDate END

    -- If the net result of the two CASE expressions is that BegDate = EndDate, you know that it has been reduced to nothingness.

    FROM

    cte_AlteredValues av;

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply