Grouping based on contiguous dates

  • Hello All

    Not sure this doesn't qualify as a 'newbie' question, but it certainly is beyond my capabilities!

    I have an inventory-type problem where I need to track start and end dates for the state of particular items.

    For example I get monthy updates of products:

    IF OBJECT_ID('TempDB..#inventory_tbl','U') IS NOT NULL

    DROP TABLE #inventory_tbl

    CREATE TABLE #inventory_tbl(

    [Month] [smalldatetime] NOT NULL,

    [ItemID] [int] NOT NULL,

    [count A] [int] NULL,

    [count B] [int] NULL

    )

    insert into #inventory_tbl

    ([Month],[ItemID],[count A],[count B])

    SELECT 'Jan 1 2011 12:00AM','1','10','20' UNION ALL

    SELECT 'Jan 1 2011 12:00AM','2','100','200' UNION ALL

    SELECT 'Feb 1 2011 12:00AM','1','10','20' UNION ALL

    SELECT 'Feb 1 2011 12:00AM','2','100','200' UNION ALL

    SELECT 'Mar 1 2011 12:00AM','1','15','25' UNION ALL

    SELECT 'Mar 1 2011 12:00AM','2','100','250' UNION ALL

    SELECT 'Apr 1 2011 12:00AM','1','15','25' UNION ALL

    SELECT 'Apr 1 2011 12:00AM','2','200','250' UNION ALL

    SELECT 'May 1 2011 12:00AM','1','15','25' UNION ALL

    SELECT 'May 1 2011 12:00AM','2','200','250'

    so we have

    MonthItemIDCount ACount B

    2011-01-01 00:00:0011020

    2011-01-01 00:00:002100200

    2011-02-01 00:00:0011020

    2011-02-01 00:00:002100200

    2011-03-01 00:00:0011525

    2011-03-01 00:00:002100250

    2011-04-01 00:00:0011525

    2011-04-01 00:00:002200250

    2011-05-01 00:00:0011525

    2011-05-01 00:00:002200250

    So the result I am after would be:

    ItemID[Count A][Count B]Start Month End Month

    1102001/01/2011 00:0001/03/2011 00:00

    1152501/03/2011 00:00NULL

    210020001/01/2011 00:0001/03/2011 00:00

    210025001/03/2011 00:0001/04/2011 00:00

    220025001/04/2011 00:00NULL

    I have found some neat solutions here but 1) I don't know how to generalise to the case with several products and 2) I have more than 100 months of data so recursion seems not to be appropriate in this case (?).

    If anyone has some pointers as to where to go from here I would be most grateful!

    Philippe

  • so we have

    MonthItemIDCount ACount B

    2011-01-01 00:00:0011020

    2011-01-01 00:00:002100200

    2011-02-01 00:00:0011020

    2011-02-01 00:00:002100200

    2011-03-01 00:00:0011525

    2011-03-01 00:00:002100250

    2011-04-01 00:00:0011525

    2011-04-01 00:00:002200250

    2011-05-01 00:00:0011525

    2011-05-01 00:00:002200250

    So the result I am after would be:

    ItemID[Count A][Count B]Start MonthEnd Month

    1102001/01/201101/03/2011

    1152501/03/2011NULL

    210015001/01/201101/04/2011

    220015001/04/2011NULL

    Could you please check and correct your setup, and provide a bit more details:

    1. Your sample data setup contains three entries for Id: 1, Count A: 15, Count b: 25, Why [End Month] expected as NULL?

    2. There is data in setup for for Id: 2, Count A: 100, Count b: 200 and

    2, Count A: 200, Count b: 250.

    But in your expected results you don't have anything for them at all.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Apologies for that, I've edited the post with what the correct results should be.

    The start month is the first month that an item as a particular state (values for count A and Count B)

    The end month is the 1st day of the month AFTER the last appearance of the state. For example item 2 has values 100 & 200 up to and including February, so the end month is March. If there is no 'next month' the the value is NULL (indicating the current state for the item).

    Hope that makes more sense

    Philippe

  • See the link here

    http://www.sqlservercentral.com/articles/T-SQL/71550/

    WITH CTE AS (

    SELECT [Month],[ItemID],[count A],[count B],

    ROW_NUMBER() OVER(PARTITION BY [ItemID] ORDER BY [Month]) -

    ROW_NUMBER() OVER(PARTITION BY [ItemID],[count A],[count B] ORDER BY [Month]) AS rnDiff

    FROM #inventory_tbl),

    Results AS (

    SELECT [ItemID],

    [count A],

    [count B],

    MIN([Month]) AS [Start Month],

    ROW_NUMBER() OVER(PARTITION BY [ItemID] ORDER BY MIN([Month])) AS rn

    FROM CTE

    GROUP BY [ItemID],[count A],[count B],rnDiff)

    SELECT r1.[ItemID],

    r1.[count A],

    r1.[count B],

    r1.[Start Month],

    r2.[Start Month] AS [End Month]

    FROM Results r1

    LEFT OUTER JOIN Results r2 ON r2.[ItemID]=r1.[ItemID]

    AND r2.rn=r1.rn+1;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • If there is no 'next month' the the value is NULL (indicating the current state for the item).

    Which combination of id's in a sample data you gave has no "next month"?

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Brilliant. I don't understand it yet, but still brilliant. Many thanks for your help!

    Philippe

  • philippe-546556 (3/27/2012)


    Brilliant. I don't understand it yet, but still brilliant. Many thanks for your help!

    Philippe

    The "Group Islands of Contiguous Dates" article I referenced explains how this works.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Hi Eugene, to answer your question when records are in chronological order, if there are no more recent records with the same ID then that defines the current state of that ID and that will be used to define the state of that ID in the result set with 'next month' as NULL.

    Thanks Mark, it's a little above what I'm used to but I'll work through it thoroughly.

    Philippe

  • Your sample data 1:

    MonthItemIDCount ACount B

    2011-01-01 00:00:0011020

    2011-02-01 00:00:0011020

    Expected:

    ItemID[Count A][Count B]Start Month End Month

    1102001/01/2011 00:0001/03/2011 00:00

    Your sample data 2:

    MonthItemIDCount ACount B

    2011-04-01 00:00:002200250

    2011-05-01 00:00:002200250

    But Expected:

    ItemID[Count A][Count B]Start Month End Month

    220025001/04/2011 00:00NULL

    Why the second sample should show NULL as End Month?

    Both have two records per Id, Count A and Count B, both have "sequencial" dates! How the second sample is a different to the first one?

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hi Eugene

    For item 1, there are records after February, so there needs to be an end month. For item 2 there are no more records after May so may is the current record and in the result set we set the last Month to NULL.

    does that make sense?

    Philippe

  • Hi Eugene

    For item 1, there are records after February, so there needs to be an end month. For item 2 there are no more records after May so may is the current record and in the result set we set the last Month to NULL.

    does that make sense?

    Philippe

  • I still couldn't get it from your explanation, but looks like I do see now what you want ...

    Lets create some more test data (around 1.3 millions records)

    declare @i int

    set @i= 2

    while @i < 262144

    begin

    insert #inventory_tbl select [Month],[ItemID]+@i,[count A],[count B] from #inventory_tbl

    set @i = @i*2

    end

    Now, SQL windowed functions are great, but can be slower sometimes, for larger datasets, than other solutions...

    The following code does the same, without use of windowed functions. It does run almost twice as faster:

    ;WITH CTE

    AS

    (

    SELECT [ItemID],[count A],[count B], MIN([Month]) [Start Month]

    FROM #inventory_tbl

    Group BY [ItemID],[count A],[count B]

    )

    SELECT r1.*, MIN(r2.[Start Month]) [End Date]

    FROM CTE r1

    LEFT JOIN CTE r2

    ON r2.ItemID = r1.ItemID AND r2.[Start Month] > r1.[Start Month]

    Group BY r1.[ItemID],r1.[count A],r1.[count B],r1.[Start Month]

    order by r1.[ItemID],r1.[count A],r1.[count B]

    There are couple things about the way with windowed function which do make it quite slower:

    1. To many of windowed functions used

    2. Use of r2.rn=r1.rn+1 in LEFT JOIN.

    Also, with a bit of tuning (eg. adding some indexes) you can get much better performance from both solutions...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (3/27/2012)


    I still couldn't get it from your explanation, but looks like I do see now what you want ...

    Lets create some more test data (around 1.3 millions records)

    declare @i int

    set @i= 2

    while @i < 262144

    begin

    insert #inventory_tbl select [Month],[ItemID]+@i,[count A],[count B] from #inventory_tbl

    set @i = @i*2

    end

    Now, SQL windowed functions are great, but can be slower sometimes, for larger datasets, than other solutions...

    The following code does the same, without use of windowed functions. It does run almost twice as faster:

    ;WITH CTE

    AS

    (

    SELECT [ItemID],[count A],[count B], MIN([Month]) [Start Month]

    FROM #inventory_tbl

    Group BY [ItemID],[count A],[count B]

    )

    SELECT r1.*, MIN(r2.[Start Month]) [End Date]

    FROM CTE r1

    LEFT JOIN CTE r2

    ON r2.ItemID = r1.ItemID AND r2.[Start Month] > r1.[Start Month]

    Group BY r1.[ItemID],r1.[count A],r1.[count B],r1.[Start Month]

    order by r1.[ItemID],r1.[count A],r1.[count B]

    There are couple things about the way with windowed function which do make it quite slower:

    1. To many of windowed functions used

    2. Use of r2.rn=r1.rn+1 in LEFT JOIN.

    Also, with a bit of tuning (eg. adding some indexes) you can get much better performance from both solutions...

    Try adding this into the original sample data

    SELECT 'Jun 1 2011 12:00AM','1','10','20' UNION ALL

    The results of the two queries are different.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • ...

    Try adding this into the original sample data

    SELECT 'Jun 1 2011 12:00AM','1','10','20' UNION ALL

    The results of the two queries are different.

    My query will only work if status never returns to something it was before (as per OP sample data).

    If it does, I would try to use "quirky update" method.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Cheers Eugene, yes I'd figured that one out at least, but that was precisely the problem; a simple grouping won't work because the months have to be contiguous, otherwise you can get groupings over non-contiguous periods if states can revert to previous states. I guess the 'quirky update' should be called the 'right solution' instead.

    Can I ask what you call 'windowed functions'?

    Philippe

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply