T-SQL : Slicing time Islands and gaps

Question

Post reply

T-SQL : Slicing time Islands and gaps

MrRobot

Old Hand

Points: 368

June 24, 2022 at 9:55 pm

#4054597

Hello,

Consider the following table that gives the validity dates of an item :

 create table #table1(
 DD_S int,
 DF_S int,
 id_prod int,
 is_valid bit
 )
 insert into #table1
 values
 (20190101,20190601,10,0),
 (20190601,20190901,10,0),
 (20190901,20200701,10,0),
 (20200701,20211001,10,0),
 (20211001,20211101,10,0),
 (20211101,20220401,10,1),
 (20220401,20291231,10,0),
 (20190101,20200201,15,1),
 (20200201,20301101,15,0)

Now consider the following table that gives the identifier of a item throughout time:

create table #table2(
  DD int,
  DF int,
  identifier int,
  id_prod int
 )
 
 insert into #table2
 VALUES
  (20190101,20211001,5,10),    
  (20211001,20211101,4,10),
  (20211101,20291231,5,10), 
  (20190101,20190501,6,15),
  (20190501,20400101,7,15)

The goal is to 'slice' the first table to associate to each Item , while taking into account the time periods. The result should look like this:

 DD_S          DF_S      id_Prod   identifier        DD           DF        IS_Valid
 20190101    20190601    10          5             20190101    20211001       0
 20190601    20190901    10          5             20190101    20211001       0
 20190901    20200701    10          5             20190101    20211001  0
 20200701    20211001    10          5             20190101    20211001       0
 20211001    20211101    10          4             20211001    20211101       0
 20211101    20220401    10          5             20211101    20291231       1
 20220401    20291231    10          5             20211101    20291231       0
 20190101    20190501    15          6             20190101    20190501       1
 20190501    20200201    15          6             20190501    20400101       1  
 20200201    20301101    15          7             20190501    20400101       0

The idea is Slicing time periods by related time

Any idea or best solution to do that ?

Thanks

This topic was modified 2 years, 5 months ago by MrRobot.
This topic was modified 2 years, 5 months ago by MrRobot.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Steve Collins SSCrazy Points: 2169 More actions · Answer 1

Step1: convert the dates from INT to DATE

In #table1 the date ranges overlap. How to get rid of the overlaps? My code updates t1 (my guess is the ranges are inclusive of the DF_S date)

Next: confirm the basic triage of the rows in #table1. Some rows in t1 are completely covered by a row t2. Some rows are undefined by t2. Some rows need to be split

drop table if exists #table1;
go
create table #table1(
 DD_S int,
 DF_S int,
 id_prod int,
 is_valid bit)

 insert into #table1
 values
 (20190101,20190601,10,0),
 (20190601,20190901,10,0),
 (20190901,20200701,10,0),
 (20200701,20211001,10,0),
 (20211001,20211101,10,0),
 (20211101,20220401,10,1),
 (20220401,20291231,10,0),
 (20190101,20200201,15,1),
 (20200201,20301101,15,0);

drop table if exists #table2;
go
 create table #table2(
  DD int,
  DF int,
  identifier int,
  id_prod int)
 
 insert into #table2
 VALUES
  (20190101,20211001,5,10),    
  (20211001,20211101,4,10),  
  (20190101,20190501,6,15),
  (20190501,20400101,7,15);

/* use sql date type */
drop table if exists #table1_dt;
go
create table #table1_dt(
 DD_S date,
 DF_S date,
 id_prod int,
 is_valid bit)

drop table if exists #table2_dt;
go
 create table #table2_dt(
  DD date,
  DF date,
  identifier int,
  id_prod int)

insert #table1_dt
select datefromparts(DD_S/10000, DD_S/100%100, DD_S%100),
       datefromparts(DF_S/10000, DF_S/100%100, DF_S%100),
       id_prod, is_valid
from #table1;

insert #table2_dt
select datefromparts(DD/10000, DD/100%100, DD%100),
       datefromparts(DF/10000, DF/100%100, DF%100),
       identifier, id_prod
from #table2;


/* update t1 to eliminate overlaps per id_prod */
update t
  set DD_S=dateadd(month, 1, t.DD_S)
from #table1_dt t
     join #table1_dt tt on t.id_prod=tt.id_prod
                           and t.DD_S=tt.DF_S

/* triage rows in t1 */
with 
covered_cte as (
    select o.*, t.identifier
    from #table1_dt o
         join #table2_dt t on o.id_prod=t.id_prod
                           and o.DD_S between t.DD and t.DF
                           and o.DF_S between t.DD and t.DF),
undef_cte as (
    select o.*, 0 identifier
    from #table1_dt o
    where not exists (select 1 
                      from covered_cte sc
                      where sc.id_prod=o.id_prod
                            and sc.DD_S=o.DD_S
                            and sc.DF_S=o.DF_S)
          and not exists(select 1
                     from #table2_dt t
                     where t.id_prod=o.id_prod
                           and (t.DD between o.DD_S and o.DF_S
                             or t.DF between o.DD_S and o.DF_S))),
to_be_split_cte as (
    select DD_S, DF_S, id_prod, is_valid
    from #table1_dt
    except
    select DD_S, DF_S, id_prod, is_valid
    from undef_cte
    except
    select DD_S, DF_S, id_prod, is_valid
    from covered_cte)
select 'covered', * from covered_cte
union all
select 'undef', * from undef_cte
union all
select 'to_be_split', *, 0 from to_be_split_cte
order by 4, 2;

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

MrRobot Old Hand Points: 368 More actions · Answer 2

Hi @steve-2 Collins

Thanks for the help.

Your query doesn't resolve my problem i need to cover the date of overlapping for example the id_pord=15 as bellow that i want

 20190101    20190501    15          6             20190101    20190501       1
 20190501 20200201    15          7             20190501    20400101       1  
 20200201 20301101    15          7             20190501    20400101       0

This a query that it give me the right result but i want something dynamically i have a lot of identifier ..

 with t2 as
 (
     select DD, DF, identifier, id_prod
     from #table2
     union all
     select max(DF), 99991231, 5, id_prod
     from #table2
     group by id_prod
 ),
 C as
 (
     select t1.*, t2.identifier,
         iif(t1.DD_S >= t2.DD, t1.DD_S, t2.DD) as DD,
         iif(t1.DF_S <= t2.DF, t1.DF_S, t2.DF) as DF
     from #table1 t1
     inner join t2 on t2.id_prod = t1.id_prod
 )
 select DD_S, DF_S, id_prod, identifier, DD, DF, is_valid
 from C
 where DD < DF
 order by id_prod, DD

Thanks

This reply was modified 2 years, 5 months ago by MrRobot.

Steve Collins SSCrazy Points: 2169 More actions · Answer 3

You're saying "split my data" but you've provided 9 rows of input data of which 8 rows do not (seemingly) require splitting. Does my query correctly separate rows which DO NOT need to be split from rows which DO need to be split? Why in the CTE of the query you just posted would it apply identifier=5 to id_prod=15 when in the sample data provided initially it's associated with id_prod=10?

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

MrRobot Old Hand Points: 368 More actions · Answer 4

Sorry i forgot a line in sample data (just updated)

the query that i shared doesn't work perfectly but just i want to giving an idea i am trying to solve that problem .

if you see the output that i want to problem is caused with id_prod , i want to cover the gap between table 1 and table 2

DD_S DF_S id_Prod identifier DD DF IS_Valid

20190101 20190501 15 6 20190101 20190501 1

20190501 20200201 15 7 20190501 20400101 1

20200201 20301101 15 7 20190501 20400101 0

Thanks

This reply was modified 2 years, 5 months ago by MrRobot.
This reply was modified 2 years, 5 months ago by MrRobot.

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 5

One of the fastest methods ever created for doing this type of thing is documented in the following article.

https://blogs.solidq.com/en/sqlserver/packing-intervals/

It's an older article so solution #3 can be made work in SS 2017 but have never tested that solution. I HAVE tested solution #2 in the past and it's nasty fast. It's also comparatively simple. I just don't have the time today to write code to prove it. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

MrRobot Old Hand Points: 368 More actions · Answer 6

Hi,

I did the query as bellow i don't know if will perfectly work for another data but with my sample data is work fine just one problem when i do the union the first query doesn't work i don't know why but when i execute the query one by one i have the result.

Can you please tell me why the union doesn't work and if you have any elegant solution it will be great

;with cte as(
     SELECT  a.[DD_S]
           ,a.[DF_S]
           ,a.[id_Prod]
           ,b.[identifier]
           ,b.[DD] 
   , b.[DF]
   ,LAG([DD_S]) over (partition by a.id_prod order by DD_S) as LagDD_S
   ,LAG([DF_S]) over (partition by a.id_prod order by DD_S) as LagDF_S
   ,LEAD([DD_S]) over (partition by a.id_prod order by DD_S) as LeadDD_S
   ,LEAD([DD]) over (partition by a.id_prod order by DD_S) as LeadDD

       FROM table1 a 
       join table2 b
     on a.[id_Prod]=b.[id_Prod] 
 )

   select  DF as DD_S,LagDD_S as DF_S,id_prod,identifier  from cte
  where 
  1=1
   and DF<LagDF_S
   and LeadDD_S is null   /*Doesn't work but when i execute the query alone i have a result*/
 union
 select DD_S,DF_S,id_prod,identifier from cte 
 where  ([DD_S]>=[DD] and [DF_S]<=[DF])
 union
 select DD_S,LeadDD as DF_S,id_prod,identifier  from cte
 where LagDD_S is null 
  and  ([DD_S]<=[DD] and [DF_S]>=[DF])
  order by id_prod,dd_s

Thanks

This reply was modified 2 years, 5 months ago by MrRobot.

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 7

MrRobot wrote:

... and if you have any elegant solution it will be great

I posted a link in my previous response as to what I think "the" elegant solution would be. You just have to adapt it to your needs.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Collins SSCrazy Points: 2169 More actions · Answer 8

MrRobot wrote:

Hi,
I did the query as bellow i don't know if will perfectly work for another data but with my sample data is work fine just one problem when i do the union the first query doesn't work i don't know why but when i execute the query one by one i have the result.
Can you please tell me why the union doesn't work and if you have any elegant solution it will be great
Thanks

The updated code still contains the INTEGER dates. Imo it's mandatory to use SQL dates. Consider the UPDATE statements to make the date ranges unique (DD to DF - wise). Just to add 1 month to an INTEGER date would be a gigantic waste of needless complexity. Imo you should switch ALL dates stored as INTEGER to SQL dates everywhere in the database and never go backwards again

The approach of getting it to work with 1 single case and then expanding the number of inputs seems likely to be frustrating and unlikely to work out well. It's better imo to use fully representative input from the outset. It's a challenging query or so it seems. The article Jeff listed is definitely applicable. On top of the packing issue it appears the 'identifier' value gets "carried through" from the adjacent #table2 range(s)

When I re-run my code with the additional t2 input row it now seems to correctly classify 8/9ths of the #table1 rows. The remaining row needs splitting

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Steve Collins SSCrazy Points: 2169 More actions · Answer 9

To make more realistic (I guess) I added an additional row to t2 so the split requires 3 rows instead of 2. Also, I added a UNIQUE CLUSTERED INDEX on the DD/DF columns in both t1 and t2 tables.

drop table if exists #table1;
go
create table #table1(
 DD_S int,
 DF_S int,
 id_prod int,
 is_valid bit)

 insert into #table1
 values
 (20190101,20190601,10,0),
 (20190601,20190901,10,0),
 (20190901,20200701,10,0),
 (20200701,20211001,10,0),
 (20211001,20211101,10,0),
 (20211101,20220401,10,1),
 (20220401,20291231,10,0),
 (20190101,20200201,15,1),
 (20200201,20301101,15,0);

drop table if exists #table2;
go
 create table #table2(
  DD int,
  DF int,
  identifier int,
  id_prod int)
 
 insert into #table2
 VALUES
  (20190101,20211001,5,10),    
  (20211001,20211101,4,10),
  (20211101,20291231,5,10), 
  (20190101,20190501,6,15),
  (20190501,20190601,7,15), /* added row */  (20190601,20400101,8,15)
  --(20190101,20190501,6,15),
  --(20190501,20400101,7,15);

/* use sql date type */drop table if exists #table1_dt;
go
create table #table1_dt(
 DD_S date,
 DF_S date,
 id_prod int,
 is_valid bit)

drop table if exists #table2_dt;
go
 create table #table2_dt(
  DD date,
  DF date,
  identifier int,
  id_prod int)

insert #table1_dt
select datefromparts(DD_S/10000, DD_S/100%100, DD_S%100),
       datefromparts(DF_S/10000, DF_S/100%100, DF_S%100),
       id_prod, is_valid
from #table1;

insert #table2_dt
select datefromparts(DD/10000, DD/100%100, DD%100),
       datefromparts(DF/10000, DF/100%100, DF%100),
       identifier, id_prod
from #table2;


/* update t1 to eliminate overlaps per id_prod */update t
  set DD_S=dateadd(month, 1, t.DD_S)
from #table1_dt t
     join #table1_dt tt on t.id_prod=tt.id_prod
                           and t.DD_S=tt.DF_S;
create unique clustered index ndx_dd_df on #table1_dt(DD_S, DF_S);

/* update t2 to eliminate overlaps per id_prod */update t
  set DD=dateadd(month, 1, t.DD)
from #table2_dt t
     join #table2_dt tt on t.id_prod=tt.id_prod
                           and t.DD=tt.DF;
create unique clustered index ndx_dd_df on #table2_dt(DD, DF);

This query expands the non-covered DD_S to DF_S range from t1 and OUTER APPLY's a correlated subquery which expands the DD to DS range from t2.

with 
covered_cte as (
    select o.*, t.identifier
    from #table1_dt o
         join #table2_dt t on o.id_prod=t.id_prod
                           and o.DD_S between t.DD and t.DF
                           and o.DF_S between t.DD and t.DF),
undef_cte as (
    select o.*, 0 identifier
    from #table1_dt o
    where not exists (select 1 
                      from covered_cte sc
                      where sc.id_prod=o.id_prod
                            and sc.DD_S=o.DD_S
                            and sc.DF_S=o.DF_S)
          and not exists(select 1
                     from #table2_dt t
                     where t.id_prod=o.id_prod
                           and (t.DD between o.DD_S and o.DF_S
                             or t.DF between o.DD_S and o.DF_S))),
to_be_split_cte as (
    select DD_S, DF_S, id_prod, is_valid
    from #table1_dt
    except
    select DD_S, DF_S, id_prod, is_valid
    from undef_cte
    except
    select DD_S, DF_S, id_prod, is_valid
    from covered_cte)
select 'to_be_split', s.*, 0, calc.dt,
       min_range.start_dd min_start_dd, min_range.end_df min_end_df,
       max_range.start_dd max_start_dd, max_range.end_df max_end_df,
       ott.identifier ott_id
from to_be_split_cte s
     /* locate t2 range starting point */     outer apply (select top(1) tt.DD, tt.DF
                  from #table2_dt tt
                  where tt.id_prod=s.id_prod
                        and tt.DF>=s.DD_S
                  order by tt.DF) min_range(start_dd, end_df) 
     /* locate t2 range ending point */     outer apply (select top(1) tt.DD, tt.DF
                  from #table2_dt tt
                  where tt.id_prod=s.id_prod
                        and tt.DD<=s.DF_S
                  order by tt.DF desc) max_range(start_dd, end_df) 
     /* expand t1 date range */     cross apply dbo.fnTally(0, datediff(month, s.DD_S, s.DF_S)) fn
     cross apply (values (dateadd(month, fn.n, s.DD_S))) calc(dt)
     /* outer join/apply expanded t2 range(s) */
     outer apply (select tt.*, tt_calc.dt
                  from #table2_dt tt
                       /* expand t2 date range(s) */
                       cross apply dbo.fnTally(0, datediff(month, tt.DD, tt.DF)) fn
                       cross apply (values (dateadd(month, fn.n, tt.DD))) tt_calc(dt)
                  where tt.id_prod=s.id_prod
                        and tt_calc.dt=calc.dt) ott;

This query uses GROUP BY to summarize the results from above. UNION ALL with the "covered" rows

with 
covered_cte as (
    select o.*, t.identifier
    from #table1_dt o
         join #table2_dt t on o.id_prod=t.id_prod
                           and o.DD_S between t.DD and t.DF
                           and o.DF_S between t.DD and t.DF),
undef_cte as (
    select o.*, 0 identifier
    from #table1_dt o
    where not exists (select 1 
                      from covered_cte sc
                      where sc.id_prod=o.id_prod
                            and sc.DD_S=o.DD_S
                            and sc.DF_S=o.DF_S)
          and not exists(select 1
                     from #table2_dt t
                     where t.id_prod=o.id_prod
                           and (t.DD between o.DD_S and o.DF_S
                             or t.DF between o.DD_S and o.DF_S))),
to_be_split_cte as (
    select DD_S, DF_S, id_prod, is_valid
    from #table1_dt
    except
    select DD_S, DF_S, id_prod, is_valid
    from undef_cte
    except
    select DD_S, DF_S, id_prod, is_valid
    from covered_cte)
select min(calc.dt) DD_S,
       max(calc.dt) DF_F,
       s.id_prod, s.is_valid, ott.identifier
from to_be_split_cte s
     /* locate t2 range starting point */     outer apply (select top(1) tt.DD, tt.DF
                  from #table2_dt tt
                  where tt.id_prod=s.id_prod
                        and tt.DF>=s.DD_S
                  order by tt.DF) min_range(start_dd, end_df) 
     /* locate t2 range ending point */     outer apply (select top(1) tt.DD, tt.DF
                  from #table2_dt tt
                  where tt.id_prod=s.id_prod
                        and tt.DD<=s.DF_S
                  order by tt.DF desc) max_range(start_dd, end_df) 
     /* expand t1 date range */     cross apply dbo.fnTally(0, datediff(month, s.DD_S, s.DF_S)) fn
     cross apply (values (dateadd(month, fn.n, s.DD_S))) calc(dt)
     /* outer join/apply expanded t2 range(s) */
     outer apply (select tt.*, tt_calc.dt
                  from #table2_dt tt
                       /* expand t2 date range(s) */
                       cross apply dbo.fnTally(0, datediff(month, tt.DD, tt.DF)) fn
                       cross apply (values (dateadd(month, fn.n, tt.DD))) tt_calc(dt)
                  where tt.id_prod=s.id_prod
                        and tt_calc.dt=calc.dt) ott
group by s.id_prod, s.is_valid, ott.identifier
union all
select * from covered_cte
order by id_prod, DD_S;

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Steve Collins SSCrazy Points: 2169 More actions · Answer 10

The function dbo.fnTally can be found and explained here

CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
    Jeff Moden Script on SSC: https://www.sqlservercentral.com/scripts/create-a-tally-function-fntally
**********************************************************************************************************************/
        (@ZeroOrOne BIT, @MaxN BIGINT)
RETURNS TABLE WITH SCHEMABINDING AS 
 RETURN WITH
  H2(N) AS ( SELECT 1 
               FROM (VALUES
                     (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    )V(N))            --16^2 or 256 rows
, H4(N) AS (SELECT 1 FROM H2 a, H2 b) --16^4 or 65,536 rows
, H8(N) AS (SELECT 1 FROM H4 a, H4 b) --16^8 or 4,294,967,296 rows
            SELECT N = 0 WHERE @ZeroOrOne = 0 UNION ALL
            SELECT TOP(@MaxN)
                   N = ROW_NUMBER() OVER (ORDER BY N)
              FROM H8
;

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

MrRobot Old Hand Points: 368 More actions · Answer 11

Hi,

Just to answer to your question about why i have int date because it's a fact table ,

Thanks for new code. I will test that in my sample data then in my fact table, hope that query doesn't take me a lot of time

Well,i let you know .

Thanks

This reply was modified 2 years, 5 months ago by MrRobot.