June 16, 2020 at 2:31 am
Hello all,
I would like to take input scripts and generate output data as given below. output is expected as if member is continuously enrolled without any gap in enrollment get min and max and if there is enrollment gap leave as is. for any given member there are no multiple spans just leave as is. for id 333 there are no multiple spans just leave as is and for id 222 bring one final line as given in the output data. for id 111 merge one span but not other span because one span is continues where as other spans is not continous , please refer output table for reference in the below scripts.
Any help is appreciated.
create table #test1
(id int,startdate datetime ,enddate datetime)
insert into #test1
select 111,'2016-12-01 00:00:00','2020-03-31 00:00:00'
insert into #test1
select 111,'2020-05-01 00:00:00','2020-05-31 00:00:00'
insert into #test1
select 111,'2020-06-01 00:00:00','2078-12-31 00:00:00'
insert into #test1
select 222,'2020-01-01 00:00:00','2020-02-29 00:00:00'
insert into #test1
select 222,'2020-03-01 00:00:00','2078-12-31 00:00:00'
insert into #test1
select 333,'2015-01-01 00:00:00', '2078-12-31 00:00:00'
--input table
select * from #test1
create table #test2
(id int,startdate datetime ,enddate datetime)
insert into #test2
select 111,'2016-12-01 00:00:00','2020-03-31 00:00:00'
insert into #test2
select 111,'2020-05-01 00:00:00','2078-12-31 00:00:00'
insert into #test2
select 222,'2020-01-01 00:00:00','2078-12-31 00:00:00'
insert into #test2
select 333,'2015-01-01 00:00:00', '2078-12-31 00:00:00'
--output table
select * from #test2?
June 16, 2020 at 8:12 am
The first thing i could think of seems to work but its not pretty i guess and im sure someone has a better solution, so lets see:
My Approach:
select id,startdate=min(startdate),enddate=max(enddate)
from (select *,Diff=isnull(DATEDIFF(DD,enddate,Test),1)
from (
select *,Test=LEAD(startdate) OVER(PARTITION BY ID ORDER BY ID,startdate)
from #test1
) Previous_Date
) Date_Difference
group by id,Diff
order by 1,3
I want to be the very best
Like no one ever was
June 16, 2020 at 12:39 pm
https://www.sqlservercentral.com/forums/topic/consolidate-overlapping-date-periods
SELECT ID, StartDate = MIN(ts), EndDate = MAX(ts)
FROM (
SELECT ID, ts, Snapper = (ROW_NUMBER() OVER(ORDER BY ID, ts)+1)/2
FROM (
SELECT *, se = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ts2, Type DESC)
FROM (
SELECT ID, ts = StartDate, ts2 = StartDate, Type = 1
,e = NULL
,s = (2*ROW_NUMBER() OVER (PARTITION BY ID ORDER BY StartDate))-1
FROM #test1
UNION ALL
SELECT ID, ts = EndDate, ts2 = DATEADD(DAY,1,EndDate), Type = -1
,e = (2*ROW_NUMBER() OVER (PARTITION BY ID ORDER BY EndDate))
,s = NULL
FROM #test1
) d
) e
WHERE se IN (e,s)
) f
GROUP BY ID, Snapper
ORDER BY id, StartDate
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 16, 2020 at 2:04 pm
Thank you for your reply and help, this is working but I cannot understand why we are multiplying with 2* to the row number for e and s columns and then again /2 row number for snapper columns, it is so confusing.
Also I have 3 miliion rows of actual data, do you think this approach is good to follow and will work fine with any issues. Thanks again
June 16, 2020 at 3:14 pm
Thank you for your reply and help, this is working but I cannot understand why we are multiplying with 2* to the row number for e and s columns and then again /2 row number for snapper columns, it is so confusing.
Also I have 3 miliion rows of actual data, do you think this approach is good to follow and will work fine with any issues. Thanks again
Try testing with say 100,000 rows first to get an idea of performance - select them out into a temp table ordered by ID so that you get all members of each ID partition except perhaps the last. A clustered index on ID will help a little with the query. I can't say how long it will take. The query is explained in the link I posted earlier, but the 2* for a and s columns is self explanatory if you execute the innermost query - I wrote it like this rather than as chained CTE's so you could do this. Any further questions, just holler.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 16, 2020 at 5:52 pm
To resolve for arbitrary relationships in the data requires hierarchical recursion, no? The OP's question and Phil's too
drop table if exists dbo.test_1;
go
create table dbo.test_1(id int,startdate datetime ,enddate datetime)
go
insert dbo.test_1(id, startdate, enddate) values
(111,'2016-12-01 00:00:00','2020-03-31 00:00:00'),
(111,'2020-05-01 00:00:00','2020-05-31 00:00:00'),
(111,'2020-06-01 00:00:00','2078-12-31 00:00:00'),
(111,'2079-06-01 00:00:00','2080-12-31 00:00:00'),
(111,'2081-01-01 00:00:00','2083-11-21 00:00:00'),
(111,'2083-11-22 00:00:00','2086-12-31 00:00:00'),
(111,'2089-06-01 00:00:00','2090-12-31 00:00:00'),
(222,'2020-01-01 00:00:00','2020-02-29 00:00:00'),
(222,'2020-03-01 00:00:00','2078-12-31 00:00:00'),
(333,'2015-01-01 00:00:00', '2078-12-31 00:00:00');
drop function if exists dbo.test_downlines;
go
create function dbo.test_downlines(
@idint,
@startdate datetime)
returns table as
return
with
base_cte(parent_id, match_dt, id, startdate, enddate) as (
select
lag(id) over (partition by id order by id, startdate),
iif((lag(id) over (partition by id order by id, startdate) is null), null, dateadd(dd, -1, startdate)),
t1.*
from
dbo.test_1 t1),
recur_cte(parent_id, match_dt, id, startdate, enddate, h_level) as (
select
*,
cast(0 as int)
from
base_cte t1
where
id=@id
and startdate=@startdate
union all
select
t1.*,
rc.h_level+1
from
base_cte t1
join
recur_cte rc on t1.parent_id=rc.id
and t1.match_dt=rc.enddate)
select
max(rc.enddate) enddate
from
recur_cte rc;
go
;with
base_cte(id, startdate, enddate) as (
select t1.*
from
dbo.test_1 t1
where not exists
(select 1 from dbo.test_1 in_t1 where t1.id=in_t1.id and t1.startdate=dateadd(dd, 1, in_t1.enddate)))
select
bc.*, td.enddate range_end_dt
from
base_cte bc
cross apply
dbo.test_downlines(bc.id, bc.startdate) td;
Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können
June 16, 2020 at 8:29 pm
This seems to be a typical gaps & islands issue...here is another method:
Declare @test1 Table (id int, startdate datetime, enddate datetime);
Insert Into @test1 (id, startdate, enddate)
Values (111, '2016-12-01 00:00:00', '2020-03-31 00:00:00')
, (111,'2020-05-01 00:00:00','2020-05-31 00:00:00')
, (111,'2020-06-01 00:00:00','2078-12-31 00:00:00')
, (222,'2020-01-01 00:00:00','2020-02-29 00:00:00')
, (222,'2020-03-01 00:00:00','2020-05-31 00:00:00')
, (222,'2020-07-01 00:00:00','2078-12-31 00:00:00')
, (333,'2015-01-01 00:00:00', '2015-05-31 00:00:00')
, (333,'2015-06-01 00:00:00', '2015-12-31 00:00:00')
, (333,'2016-01-01 00:00:00', '2019-10-31 00:00:00')
, (333,'2020-01-01 00:00:00', '2078-12-31 00:00:00');
Declare @minStartDate datetime = (Select min(t.startdate) From @test1 t)
, @maxEndDate datetime = (Select max(t.enddate) From @test1 t);
With t (n)
As (
Select t.n
From (Values (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)) As t(n)
)
, dates (FromDate)
As (
Select Top (datediff(day, @minStartDate, @maxEndDate) + 1)
dateadd(day, checksum(row_number() over(Order By @@spid)) - 1, @minStartDate)
From t t1, t t2, t t3, t t4
)
, dateGroups
As (
Select *
, dateGroup = dateadd(day, -row_number() over(Partition By t1.id Order By d.FromDate), d.FromDate)
From @test1 t1
Inner Join dates d On d.FromDate Between t1.startdate And t1.enddate
)
Select dg.id
, StartDate = min(dg.startdate)
, EndDate = max(dg.enddate)
From dateGroups dg
Group By
dg.id
, dg.dateGroup
Order By
dg.id
, StartDate
I added a few additional rows for 222 and 333 - for 333 there are 3 rows that are continuous and the 4th row with a gap.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
June 17, 2020 at 9:50 am
Here's a little sample data generator for anyone wishing to play along:
IF OBJECT_ID('TEMPDB..#test1') IS NOT NULL DROP TABLE #test1
SELECT RowID = ROW_NUMBER() OVER(ORDER BY ID, Startdate, Enddate),
ID, Startdate, Enddate
INTO #test1
FROM (VALUES
(3565, '20111101', '20130128'),(3565, '20120101', '20130128'),(3565, '20120405', '20120630'),
(3565, '20120525', '20120630'),(3565, '20130125', '20130128'),(3565, '20140220', '20140313'),
(3577, '20101202', '20150120'),(3612, '20111101', '20130206'),(3612, '20111101', '20130807'),
(3612, '20111101', '20140731'),(3612, '20120823', '20120905'),(3612, '20120914', '20121031'),
(3612, '20121101', '20121128'),(3612, '20121206', '20121224'),(3612, '20121226', '20130109'),
(3612, '20130207', '20140731'),(3724, '20111101', '20130807'),(3724, '20120426', '20120508'),
(3724, '20120515', '20120531'),(3724, '20120524', '20120531'),(3724, '20120607', '20120706'),
(3724, '20120614', '20120713'),(3724, '20120802', '20120831'),(3724, '20120901', '20120930'),
(3730, '20130516', '20130605'),(3730, '20130516', '20130807'),(3730, '20130516', '20131219'),
(3730, '20130620', '20130704'),(3730, '20130711', '20130724'),(3730, '20130808', '20131219'),
(3730, '20131212', '20140612'),(3730, '20140501', '20140528'),(3730, '20140619', '20141217'),
(3730, '20140724', '20140820'),(3730, '20140814', '20140910'),(3730, '20140904', '20141001'),
(3730, '20140911', '20141008'),(3730, '20141002', '20141029'),(3730, '20141106', '20141203'),
(3730, '20141218', '20150617'),(3730, '20150129', '20150225'),(3730, '20150205', '20150304'),
(3730, '20150305', '20150401'),(3730, '20150402', '20150429'),(3730, '20150430', '20150527'),
(3730, '20150618', '20151223'),(3730, '20150723', '20150819'),(3730, '20150820', '20150916'),
(3730, '20150917', '20151014'),(3730, '20151022', '20151118'),(3730, '20160107', '20160203'),
(3730, '20160107', '20160629'),(3730, '20160204', '20160302'),(3730, '20160303', '20160316'),
(3730, '20160303', '20160406'),(3730, '20160317', '20160330'),(3730, '20160414', '20160629'),
(3730, '20160707', '20160803'),(3730, '20160804', '20160817'),(3730, '20160804', '20160831'),
(3730, '20160804', '20160914'),(3730, '20160901', '20160914'),(3730, '20160901', '20160928'),
(3730, '20160915', '20170201'),(3730, '20160915', '20170215'),(3730, '20161027', '20161109'),
(3730, '20161027', '20161123'),(3730, '20161124', '20161207'),(3730, '20161124', '20161221'),
(3730, '20170105', '20170201'),(3730, '20170119', '20170215'),(3730, '20170202', '20170215'),
(3730, '20170216', '20170510'),(3730, '20170427', '20170524'),(3730, '20170511', '20170719'),
(3730, '20170629', '20170630'),(3730, '20170720', '20170802'),(3730, '20170728', '20170729'),
(3730, '20170803', '20171025'),(3730, '20170817', '20171025'),(3730, '20171026', '20171206'),
(3730, '20171207', '20180103'),(3730, '20180104', '20180228'),(3730, '20180301', '20180523'),
(3730, '20180524', '20180801'),(3730, '20180802', '20181024'),(3730, '20181025', '20190619')
) d (ID, Startdate, Enddate)
CREATE CLUSTERED INDEX cx_Stuff ON #test1 (ID, StartDate)
SELECT * FROM #test1
My query generates the following output which looks correct:
IDStartDateEndDate
35652011-11-012013-01-28
35652014-02-202014-03-13
35772010-12-022015-01-20
36122011-11-012014-07-31
37242011-11-012013-08-07
37302013-05-162014-06-12
37302014-06-192015-12-23
37302016-01-072016-06-29
37302016-07-072019-06-19
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 17, 2020 at 9:53 am
Thank you for your reply and help, this is working but I cannot understand why we are multiplying with 2* to the row number for e and s columns and then again /2 row number for snapper columns, it is so confusing.
Also I have 3 miliion rows of actual data, do you think this approach is good to follow and will work fine with any issues. Thanks again
A sample data set of 4,000 rows completes in about 40ms using my query, on this arbitrary system.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 17, 2020 at 11:36 am
The OP wrote: "output is expected as if member is continuously enrolled without any gap in enrollment get min and max and if there is enrollment gap leave as is." Is the sample data above representative of the OP's question? None of the sample data provided by the OP has overlapping date ranges. Can a "membership" overlap itslef? The OP's definition of "continuous" appears to be the difference between startdate and enddate (between "memberships") is 1 day. If that condition isn't met then leave the row(s) as is. According to that definition my code appears to work and the result is:
idstartdateenddate range_end_dt
35652011-11-012013-01-282013-01-28
35652012-01-012013-01-282013-01-28
35652012-04-052012-06-302012-06-30
35652012-05-252012-06-302012-06-30
35652013-01-252013-01-282013-01-28
35652014-02-202014-03-132014-03-13
35772010-12-022015-01-202015-01-20
36122011-11-012013-02-062014-07-31
36122011-11-012013-08-072014-07-31
36122011-11-012014-07-312014-07-31
36122012-08-232012-09-052012-09-05
36122012-09-142012-10-312012-11-28
36122012-12-062012-12-242012-12-24
36122012-12-262013-01-092013-01-09
37242011-11-012013-08-072013-08-07
37242012-04-262012-05-082012-05-08
37242012-05-152012-05-312012-05-31
37242012-05-242012-05-312012-05-31
37242012-06-072012-07-062012-07-06
37242012-06-142012-07-132012-07-13
37242012-08-022012-08-312012-09-30
37302013-05-162013-06-052013-12-19
37302013-05-162013-08-072013-12-19
37302013-05-162013-12-192013-12-19
37302013-06-202013-07-042013-07-04
37302013-07-112013-07-242013-07-24
37302013-12-122014-06-122014-06-12
37302014-05-012014-05-282014-05-28
37302014-06-192014-12-172015-12-23
37302014-07-242014-08-202014-08-20
37302014-08-142014-09-102014-10-08
37302014-09-042014-10-012014-10-29
37302014-11-062014-12-032014-12-03
37302015-01-292015-02-252015-02-25
37302015-02-052015-03-042015-05-27
37302015-07-232015-08-192015-10-14
37302015-10-222015-11-182015-11-18
37302016-01-072016-02-032016-06-29
37302016-01-072016-06-292016-06-29
37302016-04-142016-06-292016-06-29
37302016-07-072016-08-032019-06-19
37302016-10-272016-11-092016-12-21
37302016-10-272016-11-232016-12-21
37302017-01-052017-02-012019-06-19
37302017-01-192017-02-152019-06-19
37302017-04-272017-05-242017-05-24
37302017-06-292017-06-302017-06-30
37302017-07-282017-07-292017-07-29
37302017-08-172017-10-252019-06-19
Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können
June 17, 2020 at 1:27 pm
The query could be without the cte
select
t1.id, cast(t1.startdate as date) startdate, cast(t1.enddate as date) enddate, cast(td.enddate as date) range_end_dt
from
dbo.test_1 t1
cross apply
dbo.test_downlines(t1.id, t1.startdate) td
where not exists
(select 1 from dbo.test_1 in_t1 where t1.id=in_t1.id and t1.startdate=dateadd(dd, 1, in_t1.enddate))
order by 1, 2;
Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können
June 17, 2020 at 1:42 pm
The query could be without the cte
select
t1.id, cast(t1.startdate as date) startdate, cast(t1.enddate as date) enddate, cast(td.enddate as date) range_end_dt
from
dbo.test_1 t1
cross apply
dbo.test_downlines(t1.id, t1.startdate) td
where not exists
(select 1 from dbo.test_1 in_t1 where t1.id=in_t1.id and t1.startdate=dateadd(dd, 1, in_t1.enddate))
order by 1, 2;
Can you write it without the function?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
June 17, 2020 at 2:27 pm
Can you write it without the function?
In theory it should be possible but maybe not easily. The tvf is parameterized so I think to remove it would require splitting it into two nested CROSS APPLY statements. Yes the OP asked to resolve the example in temp table and my code only works against a physical table. Fudged it because the OP also said "I have 3 miliion rows of actual data" which seems likely 🙂 to be in a physical table somewhere 🙂
Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können
June 23, 2020 at 3:05 pm
Did you know that a table must have a key. by definition? What you posted can never have a key goes all the columns can be NULL! Furthermore, an identifier cannot be numeric because you don't do calculations on it and it is measured on a nominal scale. We also have a date data type now, so there's no need to use the old Sybase datetime from 30 years ago. Nor should you be using the old row at a time. Insertion syntax from Sybase. Finally, I'm going to assume that you don't want an event to end before it starts. So you need a constraint. Remember, SQL is a declarative language. Most of the work is done in the DDL.
In short, you just described a deck of old punchcards written in SQL. Let's fix that:
CREATE TABLE Events
(event_id CHAR(3) NOT NULL,
event_start_date DATE NOT NULL,
PRIMARY KEY (event_id, event_start_date), -- not an option!!
event_end_date DATE NOT NULL,
CHECK(event_start_date <= event_end_date));
INSERT INTO Events
VALUES
('111', '2016-12-01', '2020-03-31'),
('111', '2020-05-01', '2020-05-31'),
('111', '2020-06-01', '2078-12-31'),
('222', '2020-01-01', '2020-02-29'),
('222', '2020-03-01', '2078-12-31')'
('333', '2015-01-01', '2078-12-31'),
Your mindset is still thinking of punchcards or magnetic tape files, where records are inserted one at a time. But SQL is based on tables and rose which are completely different. The table constructor, values, puts them all in at once and the insertion can be optimized.
Please read some of the books by Chris Date. He discusses why you should never have two tables with the same structure in the same schema. What you are doing is mimicking the way we did magnetic tape files back the 1960s. In SQL we can do a view, or update a base table. You are using SQL as if you were hanging a new tape on a new tape drive and doing a file merge. Your mindset is still locked in physical models of data.
My guess is you don't want that second table, but would rather update the existing Events base table. Google around for "islands and gaps" for various solutions. The lead () and lag () functions will be very useful for you. You might want to write a stored procedure that will find an existing event duration and do an update to it. If it qualifies
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply