How to extract overlapping date ranges from a table of date ranges

Question

How to extract overlapping date ranges from a table of date ranges

stig.benning

SSC Eights!

Points: 809
More actions
February 11, 2016 at 8:57 am

#324740

Would appreciate help and or hints to the following scenario::
Magazine subscriptions are registered on a per subscriber basis with subscription id, subscriber id, start date, and end date.
A subscriber can have N number of subscriptions. A subscription has a unique id, a start date, and an end date.
Subscriptions for a subscriber could be like in the image below where each line describes a unique subscription with a start date and an end date.
How to handle overlaps on a per subscriber basis so that the result set described in the image below can be delivered?
Thank you in advance.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Here's a live topic on the subject.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

drew.allen SSC Guru Points: 76989 More actions · Answer 2

I set up some sample data (only one subscriber) and come up with the following:

WITH Subscriptions AS (

SELECT *

FROM (

VALUES

(1, CAST('2014-01-01' AS DATE), CAST('2014-03-01' AS DATE))

,(1, '2014-02-01', '2014-05-01')

,(1, '2014-04-01', '2014-06-01')

,(1, '2014-07-01', '2014-11-01')

,(1, '2014-08-01', '2014-10-01')

,(1, '2014-09-01', '2014-12-01')

,(1, '2015-02-01', '2015-04-01')

,(1, '2015-03-01', '2015-05-01')

,(1, '2015-06-01', '2015-07-01')

) v(subscriber_id, start_dt, end_dt)

)

, subscription_status AS (

SELECT *

, CASE WHEN SUM(v.is_subscribed) OVER(PARTITION BY s.subscriber_id ORDER BY v.dt, v.is_subscribed DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > 0 THEN 1 ELSE 0 END AS subscribe_status

FROM Subscriptions s

CROSS APPLY (

VALUES

(s.start_dt, 1)

,(s.end_dt, -1)

) v(dt, is_subscribed)

)

, subscription_groups AS (

SELECT ss.subscriber_id, ss.start_dt, ss.end_dt, ss.subscribe_status

, ROW_NUMBER() OVER(PARTITION BY ss.subscriber_id ORDER BY ss.dt, ss.subscribe_status DESC) - ROW_NUMBER() OVER(PARTITION BY ss.subscriber_id, ss.subscribe_status ORDER BY dt) AS subscription_group

FROM subscription_status ss

)

SELECT sg.subscriber_id, MIN(sg.start_dt) AS start_dt, MAX(end_dt) AS end_dt

FROM subscription_groups sg

WHERE sg.subscribe_status = 1

GROUP BY sg.subscriber_id, sg.subscription_group

ORDER BY sg.subscriber_id, start_dt

I haven't done extensive testing on this, so I'm not sure how well it scales.

While the "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" isn't absolutely necessary, it will perform better than the default "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" as long as the number of rows processed is under a certain threshold.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jeff Moden SSC Guru Points: 1004372 More actions · Answer 3

Itzik Ben-Gan wrote about this quite a while back but he took the post down and haven't seen it reposted. Might be around somewhere.

Anyway, I'm assuming that some decent scalability is needed so here's a 10 million row table over a period of 10 years with 1 million subscribers each having about 10 subscriptions that last from 0 to 5 years. It takes about 63 seconds to build the rows and the 3 indexes this problem needs.

--=====================================================================================================================

-- Create a million row test table.

--=====================================================================================================================

--===== If the test table exists, drop it to make reruns in SSMS easier

SET NOCOUNT ON;

IF OBJECT_ID(N'tempdb..#TestTable') IS NOT NULL

DROP TABLE #TestTable

;

GO

--===== Create and populate a 10 million row test table on the fly.

-- 1,000,000 random IDs with ~10 random date spans of 0 to 5 years each

WITH

cteGenDates AS

(

SELECT TOP 10000000

SubscriberID = ABS(CHECKSUM(NEWID()))%1000000+1

,StartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%DATEDIFF(dd,'2010','2020'),'2010')

,Span = ABS(CHECKSUM(NEwID()))%(365*5)

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

)

SELECT SubscriptionID = IDENTITY(INT,1,1)

,SubscriberID

,StartDate

,EndDate = DATEADD(dd,Span,StartDate)

INTO #TestTable

FROM cteGenDates

ORDER BY StartDate --For a bit a realism

;

--===== Create the expected/needed indexes

ALTER TABLE #TestTable ADD CONSTRAINT PK_#TestTable PRIMARY KEY CLUSTERED (SubscriptionID);

CREATE UNIQUE INDEX IX_Begin_Unique ON #TestTable(SubscriberID, StartDate, SubscriptionID);

CREATE UNIQUE INDEX IX_End_Unique ON #TestTable(SubscriberID, EndDate , SubscriptionID);

So much for any magic on my part. The rest is all from Itzik. This solves for all spans for all subscribers in the entire table in about 22 seconds and saves the results in another table.

SET STATISTICS TIME,IO ON;

--===== If the results table already exists, drop it to make reruns easier.

IF OBJECT_ID('tempdb..#Results','U') IS NOT NULL

DROP TABLE #Results

;

--===== Solve the problem using Itzik's count up/count down method of grouping.

WITH

C1 AS

(

SELECT SubscriptionID

,SubscriberID

,TS = StartDate

,Type = +1

,E = NULL

,S = ROW_NUMBER() OVER (PARTITION BY SubscriberID ORDER BY StartDate, SubscriptionID)

FROM #TestTable

UNION ALL

SELECT SubscriptionID

,SubscriberID

,TS = EndDate+1

,Type = -1

,E = ROW_NUMBER() OVER(PARTITION BY SubscriberID ORDER BY EndDate, SubscriptionID)

,S = NULL

FROM #TestTable

)

,C2 AS

(

SELECT c1.*

,SE = ROW_NUMBER() OVER (PARTITION BY SubscriberID ORDER BY TS, Type DESC, SubscriptionID)

FROM C1 c1

)

,C3 AS

(

SELECT SubscriberID

,TS

,GrpNum = FLOOR((ROW_NUMBER() OVER(PARTITION BY SubscriberID ORDER BY TS)-1)/2+1)

FROM C2

WHERE COALESCE(S-(SE-S)-1, (SE-E)-E) = 0

)

SELECT SubscriberID

,StartDate = MIN(TS)

,EndDate = MAX(TS-1)

INTO #Results

FROM C3

GROUP BY SubscriberID, GrpNum

ORDER BY SubscriberID, StartDate

;

SET STATISTICS TIME,IO OFF;

You can change the "3" in the following code to compare results between the two tables.

--===== This displays the raw rows from subscriber #3 to compare with spans that follow.

DECLARE @SubscriberID INT = 3 --<<<< Change the 3 to compare other subscribers.

SELECT * FROM #TestTable WHERE SubscriberID = @SubscriberID ORDER BY SubscriberID,StartDate;

SELECT * FROM #Results WHERE SubscriberID = @SubscriberID ORDER BY SubscriberID,StartDate;

The really neat thing is that this code works in 2005 and up.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

stig.benning SSC Eights! Points: 809 More actions · Answer 4

Thanks a lot. Works like a dream. I mark this as the solution because I can apply it directly to my subscription table. Now I just have to go learn SQL 🙂 Any suggestions as to where I can do that on the level present here? Thank you in advance.

stig.benning SSC Eights! Points: 809 More actions · Answer 5

Also thank you to you, drew.allen. Also works well. Had - since this level of SQL is a bit beyond my present skill range - some difficulties running your query on my subscription table. Not your fault. Just my lack of skills 🙂

stig.benning SSC Eights! Points: 809 More actions · Answer 6

stig.benning

SSC Eights!

Points: 809

February 12, 2016 at 2:59 am

#1858269

Thank you for the link, ChrisM@Work.

Jeff Moden SSC Guru Points: 1004372 More actions · Answer 7

stig.benning (2/12/2016)
Thanks a lot. Works like a dream. I mark this as the solution because I can apply it directly to my subscription table. Now I just have to go learn SQL 🙂 Any suggestions as to where I can do that on the level present here? Thank you in advance.

The SQL is the easy part. The math that Itzik used is brilliant (and surprisingly simple, which is part of why it's so brilliant) and is the key to this. Break the code up and run each cascading CTE in turn and see the results. The SQL will come auto-magically when you do that.

As for the test data, you can find out how to do such a thing in the following two articles...

http://www.sqlservercentral.com/articles/Data+Generation/87901/

http://www.sqlservercentral.com/articles/Test+Data/88964/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

stig.benning SSC Eights! Points: 809 More actions · Answer 8

Hi Jeff, Thanks for the links and your suggestion. And for solving my problem 🙂 Stig

Jeff Moden SSC Guru Points: 1004372 More actions · Answer 9

stig.benning (2/13/2016)
Hi Jeff, Thanks for the links and your suggestion. And for solving my problem 🙂 Stig

You bet. Thank you very much for the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ben.brugman SSChampion Points: 13369 More actions · Answer 10

Hello Jeff,

Thanks for your scripts.

Did have a gap and island method which was a tad bit slower, so I am happy with the script.

But I did make some alterations.

during testing changed #Testtable to Testtable.

But that did not significantly change the timing.

Because in my situation the enddates are not always known I added:

-- randomly add some open ended situations.

update TestTable set EndDate = NULL where convert(int,startdate)%113 = 23

update TestTable set EndDate = NULL where convert(int,startdate)%113 = 42

update TestTable set EndDate = NULL where convert(int,startdate)%113 = 57

update TestTable set EndDate = NULL where convert(int,startdate)%113 = 97

And made the following changes to the code:

-- For the enddate a coalesce(EndDate+1,'20401231') is used.

SELECT SubscriptionID

,SubscriberID

,TS = coalesce(EndDate+1,'20401231') -- Testing with open end dates

-- ,TS = EndDate+1

,Type = -1

,E = ROW_NUMBER() OVER(PARTITION BY SubscriberID ORDER BY coalesce(EndDate,'20401231'), SubscriptionID)

-- ,E = ROW_NUMBER() OVER(PARTITION BY SubscriberID ORDER BY EndDate, SubscriptionID)

,S = NULL

FROM TestTable

Did some trial runs with some variations.

Number of Null's does influence the timing but that is limited.

Cleaning the cache does influence the timing, but limited.

(Strange effect reruning without cleaning the cache, made the timing worse. :ermm::ermm::ermm:)

The code as supplied run in about 23 to 25 seconds. (On my system).

With the open ended enddates the timing became 66 to 85 seconds.

With an open start as wel timing became about 122 seconds.

(Codechange for startdate is similar, but now with a 'low' date).

Thanks for the code example.

Open EndDate:

Any remarks how to improve this code but then adjusted to open ending dates (NULL's) ?

Open end dates are part of the design.

Open StartDate:

Open start dates were included as an exercise only.

But the occurence of them is rare. (Only in some places in the design, and then still rare).

Still working on situations where we want to select a specific period and subscribers at that moment.

And working on subscribers 'Now'.

Any comments are welcome.

(I am thinking about two selections one without the open ending one with and union the both).

But for now my time is up, my wife expects me for supper :-D.

Ben

Jeff Moden SSC Guru Points: 1004372 More actions · Answer 11

ben.brugman (2/16/2016)
Still working on situations where we want to select a specific period and subscribers at that moment.

Please see the following article for that...

http://www.sqlservercentral.com/articles/T-SQL/105968/

You can easily warp it to make the "Previous Month" virtually any period you desire.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ben.brugman SSChampion Points: 13369 More actions · Answer 12

Jeff Moden (2/16/2016)
ben.brugman (2/16/2016)
Still working on situations where we want to select a specific period and subscribers at that moment.
Please see the following article for that...

Did read your article. Thanks.

The problem is not: 'a correct SQL for selection'. The problem is to have a fast (and offcourse correct) SQL solution.

During each (limited) period only a limited number of customers have a 'subscription'. Almost all 'subscriptions' are short lived. Most 'asked' question is something about the current customers.

Main problem is that although most current customers have a startdate not far in the past, and a enddate which is Null or not far in the future. Some customers have a startdate a long time ago and/or a enddate far in the future. Even when using indexes, there is no 'ready' solution to index the startdate and/or enddate so that we have a fast solution. *)

The given solution is something to build on.

Ben

*)

In our Legacy system, we used to have a copy of the 'customers' table to hold only the 'current' customers, this contained only a fraction of the customers. Maybe something with an indexed view which only contain the 'active' **) customers would be a solution (?).

**)

'active' customers is offcourse a bit more complex.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

ben.brugman (2/18/2016)
Jeff Moden (2/16/2016)
ben.brugman (2/16/2016)
Still working on situations where we want to select a specific period and subscribers at that moment.
Please see the following article for that...
Did read your article. Thanks.
The problem is not: 'a correct SQL for selection'. The problem is to have a fast (and offcourse correct) SQL solution.
During each (limited) period only a limited number of customers have a 'subscription'. Almost all 'subscriptions' are short lived. Most 'asked' question is something about the current customers.
Main problem is that although most current customers have a startdate not far in the past, and a enddate which is Null or not far in the future. Some customers have a startdate a long time ago and/or a enddate far in the future. Even when using indexes, there is no 'ready' solution to index the startdate and/or enddate so that we have a fast solution. *)
The given solution is something to build on.
Ben
*)
In our Legacy system, we used to have a copy of the 'customers' table to hold only the 'current' customers, this contained only a fraction of the customers. Maybe something with an indexed view which only contain the 'active' **) customers would be a solution (?).
**)
'active' customers is offcourse a bit more complex.

Hi Ben

Most of the oldtimers around here will have worked with time-expired rows and will be in a position to help you with this. How about starting a new thread with some sample data thrown in? A whole bunch of sample data would be good if you're looking for performance. Depending on what you're trying to do, optimum indexing can be quite unexpected. For instance (from a recent gig), you might expect that an index on startdate, enddate would be fastest when in practice it could be enddate, startdate - but the optimiser won't use it if the slower startdate, enddate index exists...

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jeff Moden SSC Guru Points: 1004372 More actions · Answer 14

ben.brugman (2/18/2016)
Jeff Moden (2/16/2016)
ben.brugman (2/16/2016)
Still working on situations where we want to select a specific period and subscribers at that moment.
Please see the following article for that...
Did read your article. Thanks.
The problem is not: 'a correct SQL for selection'. The problem is to have a fast (and offcourse correct) SQL solution.
During each (limited) period only a limited number of customers have a 'subscription'. Almost all 'subscriptions' are short lived. Most 'asked' question is something about the current customers.
Main problem is that although most current customers have a startdate not far in the past, and a enddate which is Null or not far in the future. Some customers have a startdate a long time ago and/or a enddate far in the future. Even when using indexes, there is no 'ready' solution to index the startdate and/or enddate so that we have a fast solution. *)
The given solution is something to build on.
Ben
*)
In our Legacy system, we used to have a copy of the 'customers' table to hold only the 'current' customers, this contained only a fraction of the customers. Maybe something with an indexed view which only contain the 'active' **) customers would be a solution (?).
**)
'active' customers is offcourse a bit more complex.

Just to be sure, let's say we have a table with a million rows of subscriptions across, say, 5 years. What "period" would you be looking for out of that?

Also, the NULL end-date is always a problem for performance. It would be MUCH better to use '9999-01-01' for an indeterminate end-date than a NULL. MUCH better.

And, if that comes to pass, make sure you do NOT use '9999-12-31'. Because there is no valid date after that, it screws up some rather standard/convenient/accurate/fast code if you do.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)