Finding Beginning Date from multiple lines of date ranges

Question

Finding Beginning Date from multiple lines of date ranges

REYES_MONIKA

SSC Veteran

Points: 205
More actions
March 20, 2015 at 9:51 am

#316340

Hi Guys,
I need help to figure it out. I am trying to find a beginning date from multiple date ranges, for example:
RowNumberidBegin dtEnd Dt
107933192014-09-022015-06-30
207933192013-09-032014-09-01
307933192012-09-042013-09-02
407933192011-09-062012-09-03
For this id: 0793319, my beginning date is 2011-09-06
108203492014-09-022015-06-30
208203492013-09-032014-09-01
308203492012-09-042013-09-02
408203492011-12-122012-07-03--not a continuous date range
For this id: 0793319, my beginning date is 2012-09-04
108203492014-09-022015-06-30
For this id: 0820349, my beginning date is 2014-09-02
To find continuous date, you look at the beginning date in row 1 and end date in row 2, then if no break in dates, row 2 beginning date to row 3 end date, if no break continue until last date
There could multiple dates up to 12 which I have to check for "no break" in dates, if break, display beginning date of last continuous date.
I cannot figure out the best way to do it. Please Help!!!
Monika R.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 1

I'm sure there's a prettier solution to this but this should give you the start and end date of each contiguous group, if you want to see only groups with more than 1 records you can filter it by group level.

CREATE TABLE #temp(ID varchar(30), BEGIN_DT datetime, END_DT datetime)

INSERT INTO #temp VALUES ('0793319','2014-09-02','2015-06-30'), ('0793319', '2013-09-03', '2014-09-01'),

('0793319', '2012-09-04', '2013-09-02'), ('0793319', '2011-09-06', '2012-09-03'),

('0820349', '2014-09-02', '2015-06-30'), ('0820349', '2013-09-03', '2014-09-01'),

('0820349','2012-09-04', '2013-09-02'), ('0820349','2011-12-12', '2012-07-03')

WITH

TEMP_CTE_START_GROUP AS(

SELECT TEMP_ONE.ID, TEMP_ONE.BEGIN_DT, TEMP_ONE.END_DT, TEMP_TWO.BEGIN_DT TEMP_TWO_BEGIN_DT, 1 AS GROUP_LEVEL FROM #temp TEMP_ONE

LEFT OUTER JOIN #temp TEMP_TWO ON TEMP_ONE.ID = TEMP_TWO.ID AND TEMP_ONE.BEGIN_DT = DATEADD(day, 1, TEMP_TWO.END_DT)

),

TEMP_CTE_RECURSE AS(

SELECT ID, BEGIN_DT, END_DT, GROUP_LEVEL FROM TEMP_CTE_START_GROUP WHERE TEMP_TWO_BEGIN_DT IS NULL

UNION ALL

SELECT START_GROUP.ID, START_GROUP.BEGIN_DT, GROUP_TWO.END_DT, START_GROUP.GROUP_LEVEL + 1 FROM TEMP_CTE_RECURSE START_GROUP, TEMP_CTE_START_GROUP GROUP_TWO

WHERE START_GROUP.ID = GROUP_TWO.ID AND START_GROUP.END_DT = DATEADD(day, -1, GROUP_TWO.BEGIN_DT)

),

TEMP_CTE_FINAL AS(

SELECT ID, BEGIN_DT, END_DT, GROUP_LEVEL, ROW_NUMBER() OVER(PARTITION BY ID, BEGIN_DT ORDER BY GROUP_LEVEL DESC) AS ROW_NUM FROM TEMP_CTE_RECURSE

)

SELECT * FROM TEMP_CTE_FINAL

WHERE ROW_NUM = 1

REYES_MONIKA SSC Veteran Points: 205 More actions · Answer 2

Thank you ZZartin soooo much! It works beautifully:)

Have a great weekend:)

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 3

Just for fun here are two alternative methods, a single table scan solution without recursion and a dual self-join

😎

USE tempdb;

GO

SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.TBL_DATE_SAMPLE') IS NULL

BEGIN

CREATE TABLE dbo.TBL_DATE_SAMPLE

(

ROW_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED

,ID VARCHAR(30) NOT NULL

,BEGIN_DT DATETIME NOT NULL

,END_DT DATETIME NOT NULL

)

INSERT INTO dbo.TBL_DATE_SAMPLE(ID,BEGIN_DT,END_DT)

VALUES ('0793319','2014-09-02','2015-06-30')

,('0793319','2013-09-03','2014-09-01')

,('0793319','2012-09-04','2013-09-02')

,('0793319','2011-09-06','2012-09-03')

,('0820349','2014-09-02','2015-06-30')

,('0820349','2013-09-03','2014-09-01')

,('0820349','2012-09-04','2013-09-02')

,('0820349','2011-12-12','2012-07-03')

;

END

/**********************************************************

Self joining method

**********************************************************/

/* Adding a row_number for self-joining

*/

;WITH ROWNUMBER_DATA AS

(

SELECT

DS.ID

,ROW_NUMBER() OVER

(

PARTITION BY DS.ID

ORDER BY DS.BEGIN_DT

) AS DS_RID

,DS.BEGIN_DT

,DS.END_DT

FROM dbo.TBL_DATE_SAMPLE DS

)

/* Self-join X(from)-X-X(to)

*/

,INTERVAL_IDENTIFIED_DATA AS

(

SELECT

RD.ID

,RD.DS_RID

,CASE

WHEN RD.DS_RID = 1 THEN RD.BEGIN_DT

WHEN DATEDIFF(DAY,RD_FROM.END_DT,RD.BEGIN_DT) > 1 THEN RD.BEGIN_DT

END AS START_DT

,CASE

WHEN DATEDIFF(DAY,RD.END_DT,ISNULL(RD2.BEGIN_DT,RD.END_DT)) > 1 THEN RD.END_DT

WHEN RD2.ID IS NULL THEN RD.END_DT

END AS STOP_DT

FROM ROWNUMBER_DATA RD

LEFT OUTER JOIN ROWNUMBER_DATA RD2

ON RD.ID = RD2.ID

AND RD.DS_RID = RD2.DS_RID - 1

LEFT OUTER JOIN ROWNUMBER_DATA RD_FROM

ON RD.ID = RD_FROM.ID

AND RD.DS_RID = RD_FROM.DS_RID + 1

)

/* Add group identifiers for the final

aggregation to compress the result set

*/

,GROUPED_DATA AS

(

SELECT

IID.ID

,ROW_NUMBER() OVER

(

PARTITION BY IID.ID

ORDER BY IID.DS_RID

) + CASE

WHEN IID.STOP_DT IS NULL THEN 1

ELSE 0

END AS IID_GRP

,IID.START_DT

,IID.STOP_DT

FROM INTERVAL_IDENTIFIED_DATA IID

WHERE IID.START_DT IS NOT NULL

OR IID.STOP_DT IS NOT NULL

)

/* Filter entries which are either start

or end entries

*/

SELECT

GD.ID

,MAX(GD.START_DT)

,MAX(GD.STOP_DT)

FROM GROUPED_DATA GD

GROUP BY GD.ID

,GD.IID_GRP;

/**********************************************************

Unpivot and aggregation row shift method

**********************************************************/

/* Unpivot, group entries by marking start and end

*/

;WITH BASE_DATA AS

(

SELECT

ROW_NUMBER() OVER

(

PARTITION BY DS.ID

ORDER BY DS.BEGIN_DT

,DS.END_DT

) + UNX.DS_END AS DS_END_GRP

,DS.ID

,UNX.DS_START

,UNX.DS_END

,UNX.DS_DATE

FROM dbo.TBL_DATE_SAMPLE DS

CROSS APPLY (SELECT 1 AS DS_START, 0 AS DS_END, DS.BEGIN_DT AS DS_DATE UNION ALL

SELECT 0 AS DS_START, 1 AS DS_END, DS.END_DT AS DS_DATE) AS UNX

)

/* Flag start and end

*/

,MARKED_PERIODS AS

(

SELECT

BD.ID

,BD.DS_START

,BD.DS_END

,BD.DS_DATE

,CASE

WHEN MIN(BD.DS_DATE) OVER

(

PARTITION BY BD.ID

,BD.DS_END_GRP

) = MAX(BD.DS_DATE) OVER

(

PARTITION BY BD.ID

,BD.DS_END_GRP

) THEN 1

WHEN DATEDIFF(DAY,MIN(BD.DS_DATE) OVER

(

PARTITION BY BD.ID

,BD.DS_END_GRP

)

,MAX(BD.DS_DATE) OVER

(

PARTITION BY BD.ID

,BD.DS_END_GRP

)) > 1 THEN 1

ELSE 0

END AS P_FLAG

FROM BASE_DATA BD

)

/* Filter entries which are either start

or end entries

*/

,FINAL_SET AS

(

SELECT

MP.ID

,ROW_NUMBER() OVER

(

ORDER BY(SELECT NULL)

) + MP.DS_START AS GRP_ID

,CASE WHEN MP.DS_START = 1 THEN MP.DS_DATE END AS BEGIN_DT

,CASE WHEN MP.DS_END = 1 THEN MP.DS_DATE END AS END_DT

FROM MARKED_PERIODS MP

WHERE MP.P_FLAG = 1

)

/* Aggregation to compress the final set

*/

SELECT

FS.ID

,MAX(FS.BEGIN_DT)

,MAX(FS.END_DT)

FROM FINAL_SET FS

GROUP BY FS.ID

,FS.GRP_ID;

Results

ID BEGIN_TD END_DT

--------- ----------------------- -----------------------

0793319 2011-09-06 00:00:00.000 2015-06-30 00:00:00.000

0820349 2011-12-12 00:00:00.000 2012-07-03 00:00:00.000

0820349 2012-09-04 00:00:00.000 2015-06-30 00:00:00.000

Edit: Added self-join, code clean-up and comments

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

One of the major faults with requirements like this is that the problem definition is based on given data and people code to only the conditions in the given data. In this case, it would appear that a contiguous set of temporal intervals is defined as those rows that meet the general formula of "PreviousEndDate + Exactly 1 Day = NextBeginDate". The solutions to that problem certainly meet that requirement but the requirement does not include any enforcement of the "PreviousEndDate + Exactly 1 Day = NextBeginDate" formula within the data itself. If there is no such enforcement at the table level, data that may still fall into the category of an "overlap", thus making the interval "contiguous", will someday creep into the data and none of the solutions I tested (ZZartin and Eirikur's "unpivot" solution) produce the expected answers when that happens.

You can see for yourself. The given test data has been modified as noted below and still produces overlapping temporal intervals (note that I changed the table name to a temp table here and in all code tested)...

SET NOCOUNT ON;

IF OBJECT_ID(N'tempdb..#TestTable') IS NOT NULL

DROP TABLE #TestTable

;

GO

CREATE TABLE #TestTable

(

ROW_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED

,ID VARCHAR(30) NOT NULL

,Begin_DT DATETIME NOT NULL

,End_DT DATETIME NOT NULL

)

INSERT INTO #TestTable(ID,Begin_DT,End_DT)

VALUES ('0793319','2014-09-02','2015-06-30')

,('0793319','2013-09-03','2014-10-01') --Was 2014-09-01

,('0793319','2012-09-04','2013-09-02')

,('0793319','2011-09-06','2012-09-03')

,('0820349','2014-09-02','2015-06-30')

,('0820349','2013-09-03','2014-09-01')

,('0820349','2012-09-04','2013-09-02')

,('0820349','2011-12-12','2012-07-03')

;

ZZartin's code produces the following results...

ID Begin_DT End_DT GROUP_LEVEL ROW_NUM

------- ----------------------- ----------------------- ----------- -------

0793319 2011-09-06 00:00:00.000 2014-10-01 00:00:00.000 3 1

0793319 2014-09-02 00:00:00.000 2015-06-30 00:00:00.000 1 1

0820349 2011-12-12 00:00:00.000 2012-07-03 00:00:00.000 1 1

0820349 2012-09-04 00:00:00.000 2015-06-30 00:00:00.000 3 1

... and Eirikur's code produces the following results along with a warning...

ID

------------------------------ ----------------------- -----------------------

0793319 2011-09-06 00:00:00.000 2014-10-01 00:00:00.000

0793319 2014-09-02 00:00:00.000 2015-06-30 00:00:00.000

0820349 2011-12-12 00:00:00.000 2012-07-03 00:00:00.000

0820349 2012-09-04 00:00:00.000 2015-06-30 00:00:00.000

Warning: Null value is eliminated by an aggregate or other SET operation.

The desired answer, according to the given overlaps, would still be (with or without additional columns)...

ID Begin_DT End_DT

------- ----------------------- -----------------------

0793319 2011-09-06 00:00:00.000 2015-06-30 00:00:00.000

0820349 2011-12-12 00:00:00.000 2012-07-03 00:00:00.000

0820349 2012-09-04 00:00:00.000 2015-06-30 00:00:00.000

I remembered that Itzik Ben-Gan had provided an article on the "Solid Q" website that explained, in great detail, a very fast method for solving this type of problem. Unfortunately, they took that great article down when they changed websites a couple of years ago. The good new is that the well documented code (look for "Itzik New 1: 17 Seconds" in the following article) still exists on the SQLMag site at http://sqlmag.com/blog/solutions-packing-date-and-time-intervals-puzzle along with several other solutions that are anywhere from only twice as slow to thousands of times slower.

I applied Itzik's code to this problem with a modifier to allow for up to 1 day of gap and still have temporal intervals be contiguous. When I ran it against the modified data that I provided above, it came up with the expected result (if the expectation is start-to-end overlaps) as follows...

ID Begin_DT End_DT

------- ----------------------- -----------------------

0793319 2011-09-06 00:00:00.000 2015-06-30 00:00:00.000

0820349 2011-12-12 00:00:00.000 2012-07-03 00:00:00.000

0820349 2012-09-04 00:00:00.000 2015-06-30 00:00:00.000

Of course, the correct result is only the first challenge with such code. Temporal Interval Overlap solutions can be quite challenging to get any performance out of. Ipso-facto, I did the normal thing to test and built a million row test table. Here's the code for that test table, which has the same form as the original test data but with random temporal intervals that may or may not overlap, as in real life. Also as normal, you'll find additional details about the data as comments in the code.

--===== If the test table exists, drop it to make reruns in SSMS easier

SET NOCOUNT ON;

IF OBJECT_ID(N'tempdb..#TestTable') IS NOT NULL

DROP TABLE #TestTable

;

GO

--===== Create and populate the test table on the fly.

-- 1 Million rows with 100,000 random IDs with ~10 random date spans of 0 to 5 years each

WITH

cteGenDates AS

(

SELECT TOP 1000000

ID = CAST(RIGHT('0000000'+CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(10)),7) AS VARCHAR(30))

,Begin_DT = DATEADD(dd,ABS(CHECKSUM(NEWID()))%DATEDIFF(dd,'2010','2020'),'2010')

,Span = ABS(CHECKSUM(NEwID()))%(365*5)

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

)

SELECT Row_ID = IDENTITY(INT,1,1)

,ID

,Begin_DT

,End_DT = DATEADD(dd,Span,Begin_DT)

INTO #TestTable

FROM cteGenDates

;

--===== Create the given Clustered Index

ALTER TABLE #TestTable ADD CONSTRAINT PK_#TestTable PRIMARY KEY CLUSTERED (Row_ID);

Here's the Itzik Ben-Gan solution modified to fit the modified requirements of this problem so that ANY overlapping with a gap of up to and including 1 day will qualify as "contiguous" temporal intervals. Note that I also added a final "ORDER BY" to make the output easier to decipher. Surprisingly, that didn't add much to the total duration.

PRINT '========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================';

SET STATISTICS TIME,IO ON;

WITH

C1 AS

(

SELECT Row_ID

,ID

,TS = Begin_DT

,Type = +1

,E = NULL

,S = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Begin_DT, Row_ID)

FROM #TestTable

UNION ALL

SELECT Row_ID

,ID

,TS = End_DT+1

,Type = -1

,E = ROW_NUMBER() OVER(PARTITION BY ID ORDER BY End_DT, Row_ID)

,S = NULL

FROM #TestTable

)

,C2 AS

(

SELECT c1.*

,SE = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY TS, type DESC, ROW_id)

FROM C1 c1

)

,C3 AS

(

SELECT ID

,TS

,GrpNum = FLOOR((ROW_NUMBER() OVER(PARTITION BY ID ORDER BY TS)-1)/2+1)

FROM C2

WHERE COALESCE(S-(SE-S)-1, (SE-E)-E) = 0

)

SELECT ID

,Begin_DT = MIN(TS)

,End_DT = MAX(TS-1)

FROM C3

GROUP BY ID, GrpNum

ORDER BY ID,Begin_DT;

;

If you run that code against the million row test table, you find a rather wide and complex execution plan with 3 sorts, 2 of which contain 50% of the total cost, and a ton of parallelism (I'm running a 4 core I5 on my laptop). The statistics show the following resource usage.

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

Table '#TestTable__________________________________________________________________________________________________________000000000057'.

Scan count 10, logical reads 9928, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 18403 ms, elapsed time = 7876 ms.

Most impressive in Itzik's method is the relatively very low number of reads that were actually done. The elapsed time includes the time to display the approximately 160-170K rows (since the data is randomly constructed, the number of rows will vary when the test table is rebuilt) in the grid result mode.

Less than 8 seconds for this problem is pretty impressive and I'd likely leave it alone. Still, the sorts bother me, as they did Itzik, so I added the following indexes as he did.

--===== Create the same indexes that Itzik did

CREATE UNIQUE INDEX IX_Begin_Unique ON #TestTable(ID, Begin_DT, ROW_ID);

CREATE UNIQUE INDEX IX_End_Unique ON #TestTable(ID, End_DT , ROW_ID);

I reran the code and all of the parallelism disappeared from the execution plan. The 2 expensive sorts where now gone. Only the final sort remained. The number of reads dropped by a very nice 27%, CPU dropped by an whopping 85%, and the elapsed time (which is the only thing that affects user perception) was almost cut in half.

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#TestTable__________________________________________________________________________________________________________000000000057'.

Scan count 2, logical reads 7208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2715 ms, elapsed time = 4101 ms.

Removing the final sort (probably wouldn't be there in production) did decrease duration by about 4/10ths of a second and did decrease CPU usage by about 1.5/10ths of a second, but that's almost trivial for the clarity that it lends to the result for the human reader.

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#TestTable__________________________________________________________________________________________________________000000000057'.

Scan count 2, logical reads 7208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2574 ms, elapsed time = 3751 ms.

Still, the performance on the original table without the additional indexes was (and you won't hear me say this often) probably good enough and, considering that the additional 2 indexes nearly triple the space requirements for the table and adds to index maintenance times, insert times, backup and restore times, you might want to leave them off until the table grows much larger or a new requirement comes up that can only perform well in the presence of those indexes.

{Edit} Spelling corrections in the form of word changes ("anyway" to "anywhere", for example)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

As a bit of a sidebar, storing the "guaranteed to be 7 digits" ID in a VARCHAR(anything) data-type is not what I would do. At the very least, the use of VARCHAR() adds two bytes to every entry to remember the size of the entry. On what is supposed to be a fixed 7 digits, that's a 28% waste of space, not to mention the bit of CPU it takes to read the size of each row during processing. The "ID" column, if the bloody formatting of leading zeros is absolutely required, should be a CHAR(7).

But, that brings me to the next subject. Storing formatted data in the database. It would appear, in this case, that the ID column will always be an INTEGER. Converting the column to an INTEGER produces a major reduction (4 bytes instead of the original 9) in the basic storage requirement. If you include the two additional indexes in that it would be a savings of (9*3)-(4*3) or 15 bytes per row or 15 MILLION bytes in a million row table.

Yeah... I know. Everyone thinks disk space is cheap. Consider this, though... Every backup of this table is 15 MILLION bytes bigger than it needs to be and that's just for this table. I assume that there will be other tables that reference this one and, possibly, have indexes on those columns, as well. Since this is some form of "user ID", I can well imagine it being used just about everywhere. How much space difference will all that add up to? And non of that includes what may occur on the log file.

Still, you might have disk space to burn along with the money it takes to power them, cool them, house them, back them up, and maintain them. A couple of the more important considerations for performance in any database engine are "right sizing" and using the correct datatype.

If we change that nasty ol' zero-padded ID column to an INT, here's the performance we get without the two indexes we add after the fact...

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

Table '#TestTable__________________________________________________________________________________________________________000000000059'. Scan count 10, logical reads 8186, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 10878 ms, elapsed time = 5116 ms.

... and here's the performance we get with the added indexes...

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#TestTable__________________________________________________________________________________________________________000000000058'.

Scan count 2, logical reads 5467, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 1826 ms, elapsed time = 2968 ms.

In both cases, we got substantial reductions in the number of Reads, CPU usage, and Elapsed Time. In fact, now that the performance without the extra indexes is so much better, I'd feel even better about leaving them off. Like Dwain Camps has in his signature line, "INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?".

And for those getting ready to recite Knuth's parable about "Pre-optimization is the root of all evil", I agree with that except this has absolutely nothing to do with any form of pre-optimization. It's just good programming practice so start practicing the right way. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 6

Just brilliant work as always Jeff!

Your version of Itzik's solution is good but it leaves still some room for improvement, here is my "adjustment";-)

😎

PRINT '========== Eirikur''s version of Itzek Ben-Gan solution modified by Jeff Moden =======================================================';

SET STATISTICS TIME,IO ON;

WITH

C1 AS

(

SELECT Row_ID

,ID

,TS = Begin_DT

,Type = +1

,E = NULL

,S = ROW_NUMBER() OVER (PARTITION BY CHECKSUM(ID) ORDER BY DATEDIFF(DAY,0,Begin_DT), Row_ID)

FROM dbo.TBL_DATE_SAMPLE

UNION ALL

SELECT Row_ID

,ID

,TS = End_DT+1

,Type = -1

,E = ROW_NUMBER() OVER(PARTITION BY CHECKSUM(ID) ORDER BY DATEDIFF(DAY,0,End_DT), Row_ID)

,S = NULL

FROM dbo.TBL_DATE_SAMPLE

)

,C2 AS

(

SELECT c1.*

,SE = ROW_NUMBER() OVER (PARTITION BY CHECKSUM(ID) ORDER BY TS, type DESC, ROW_id)

FROM C1 c1

)

,C3 AS

(

SELECT ID

,TS

,GrpNum = FLOOR((ROW_NUMBER() OVER(PARTITION BY CHECKSUM(ID) ORDER BY TS)-1)/2+1)

FROM C2

WHERE COALESCE(S-(SE-S)-1, (SE-E)-E) = 0

)

SELECT ID

,Begin_DT = MIN(TS)

,End_DT = MAX(TS-1)

FROM C3

GROUP BY ID, GrpNum

ORDER BY ID,Begin_DT;

;

PRINT '========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================';

SET STATISTICS TIME,IO ON;

WITH

C1 AS

(

SELECT Row_ID

,ID

,TS = Begin_DT

,Type = +1

,E = NULL

,S = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Begin_DT, Row_ID)

FROM dbo.TBL_DATE_SAMPLE

UNION ALL

SELECT Row_ID

,ID

,TS = End_DT+1

,Type = -1

,E = ROW_NUMBER() OVER(PARTITION BY ID ORDER BY End_DT, Row_ID)

,S = NULL

FROM dbo.TBL_DATE_SAMPLE

)

,C2 AS

(

SELECT c1.*

,SE = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY TS, type DESC, ROW_id)

FROM C1 c1

)

,C3 AS

(

SELECT ID

,TS

,GrpNum = FLOOR((ROW_NUMBER() OVER(PARTITION BY ID ORDER BY TS)-1)/2+1)

FROM C2

WHERE COALESCE(S-(SE-S)-1, (SE-E)-E) = 0

)

SELECT ID

,Begin_DT = MIN(TS)

,End_DT = MAX(TS-1)

FROM C3

GROUP BY ID, GrpNum

ORDER BY ID,Begin_DT;

;

The output

========== Eirikur's version of Itzek Ben-Gan solution modified by Jeff Moden =======================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'TBL_DATE_SAMPLE'. Scan count 10, logical reads 9996, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 12634 ms, elapsed time = 4204 ms.

========== Itzek Ben-Gan solution modified by Jeff Moden =======================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'TBL_DATE_SAMPLE'. Scan count 10, logical reads 9996, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 23977 ms, elapsed time = 7472 ms.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

Thanks, Eirikur, but the brilliance is not mine. I'm just translating the work of the truly brilliant Itzik Ben-Gan.

Speaking of brilliant, using CHECKSUM() to convert the VARCHAR(30) ID to an INT and the conversion of DATETIME is brilliant in itself for the great performance improvement. Well done.

There are, however, some cautions to those techniques to note.

Neither CHECKSUM() nor BINARY_CHECKSUM() are guaranteed to produce unique values even if their operands are. For example (and just one example... you can find many more on the web), these two different values produce the same result for CHECKSUM().

SELECT CS1 = CHECKSUM('A352KD')

,CS2 = CHECKSUM('A352NT')

;

Result:

CS1 CS2

----------- -----------

141500177 141500177

Yes, I agree that there is little chance to no chance of this happening when converting a digits-only VARCHAR() but it's still something to be aware of.

Another thing to be aware of is that the conversions prevent the 2 additional indexes (when deployed) from being totally effective. Although the scans are shifted from the clustered indexes to the more narrow indexes, parallelism is still widely rampant and the 2 sorts are still present.

Although it would be my preference to not include the 2 indexes in the short term (the "good enough" thing, again), there may come a time when they are desirable but they would provide no benefit because of the conversions in the ROW_NUMBER(), and the indexing is ignored much like it would be for a non-SARGable query.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

REYES_MONIKA SSC Veteran Points: 205 More actions · Answer 8

You Guys are the BEST!!! Many helpful tips and pointers!:) So much detail, thank you all very very much.

Thank you Eirikur Eiriksson, Jeff Moden & ZZartin.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 9

REYES_MONIKA (3/23/2015)
You Guys are the BEST!!! Many helpful tips and pointers!:) So much detail, thank you all very very much.
Thank you Eirikur Eiriksson, Jeff Moden & ZZartin.

Heh... like they say, "The devil is in the details". Thank you very much for the feedback. Let us know how it all works out for you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)