finding min/max while data remains unchanged

  • What I have: id|data|date

    1|A|'1/28/08'

    1|B|'2/02/08'

    1|C|'2/03/08'

    1|C|'2/09/08'

    1|C|'2/22/08'

    1|C|'5/11/08'

    1|A|'6/01/08'

    1|D|'7/15/08'

    1|C|'7/21/08'

    1|C|'8/19/08'

    1|A|'9/17/08'

    What I want:id|data|date1|date2

    1|A|'1/28/08'|'2/02/08'

    1|B|'2/02/08'|'2/03/08'

    1|C|'2/03/08'|'6/01/08'

    1|A|'6/01/08'|'7/15/08'

    1|D|'7/15/08'|'7/21/08'

    1|C|'7/21/08'|'7/17/08'

    1|A|'9/17/08'|null

    In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?

  • Oblio Leitch (12/4/2008)


    What I have: id|data|date

    1|A|'1/28/08'

    1|B|'2/02/08'

    1|C|'2/03/08'

    1|C|'2/09/08'

    1|C|'2/22/08'

    1|C|'5/11/08'

    1|A|'6/01/08'

    1|D|'7/15/08'

    1|C|'7/21/08'

    1|C|'8/19/08'

    1|A|'9/17/08'

    What I want:id|data|date1|date2

    1|A|'1/28/08'|'2/02/08'

    1|B|'2/02/08'|'2/03/08'

    1|C|'2/03/08'|'6/01/08'

    1|A|'6/01/08'|'7/15/08'

    1|D|'7/15/08'|'7/21/08'

    1|C|'7/21/08'|'7/17/08'

    1|A|'9/17/08'|null

    In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?

    declare @t table( id int, data char(1), date datetime)

    insert into @t( id,data,date)

    select 1,'A','1/28/08'

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08'

    --select * from @t order by date

    ;with cte

    AS(

    select * , row_number() over(partition by id order by date, data) rn

    from @t

    )

    select c1.id , c1.data

    , c1.date as date1

    ,(select min(c3.date)

    from cte c3

    where c1.id = c3.id

    and c1.rn < c3.rn

    and c1.data <> c3.data) date2

    from cte c1 left outer join cte c2

    on c2.rn +1 = c1.rn

    where c1.data <> c2.data or c2.data is null

    order by c1.date


    * Noel

  • Excellent! Thank you. It looks like I was on the right track. I had the row_count() in there, but I was trying to use < rather than just stepping back down to the previous row. And, I knew I had to do a subquery to fetch the second date, but I was trying to do it looking forward. Just couldn't get my brain around it. Thank you for the assistance.

    Some changes I made to fit my case - I added the id in the join to scale to multiple ids; I added additional data fields to check against. To do so, I OR'd them in the <> checks, and changed the IS NULL to look at the id rather than the data. Also, I ran into trouble with trying to compare data when it was null, so in my cte I ISNULL'd the data columns to be an empty string.

  • Um... I'm thinking that's not what you want. Look at the ACTUAL execution plan for that... tell me where, with only 11 rows in the original table, you come up with a 121 row scan. That's right... accidental cross join. 11 * 11 = 121. What do you think will happen with 10,000 rows in the original table? Right again... 100,000,000. That's 100 MILLION internal rows being generated just for a lousy 10,000 row table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This will fly... even in the face of a million rows...

    --===== Create the same test data as before.

    -- This is NOT part of the solution.

    declare @t table( id int, data char(1), date datetime)

    insert into @t( id,data,date)

    select 1,'A','1/28/08'

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08'

    --===== Create the "Work" table

    SELECT ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data),0) AS lRowNum,

    ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data)-1,0) AS rRowNum,

    *

    INTO #Work

    FROM @t

    ALTER TABLE #Work

    ADD PRIMARY KEY CLUSTERED (lRowNum)

    --===== "Smear" the dates down in the work table

    DECLARE @PrevId INT, @PrevData CHAR(1), @PrevDate DATETIME

    SELECT @PrevId = 0, @PrevData = '', @PrevDate = 0

    UPDATE #Work

    SET @PrevDate = Date = CASE WHEN ID = @PrevID AND Data = @PrevData THEN @PrevDate ELSE Date END,

    @PrevID = ID,

    @PrevData = Data

    FROM #Work WITH (INDEX(0))

    --===== Select the final desired result set

    SELECT l.ID, l.Data, l.Date AS Date1, r.Date AS Date2

    FROM #Work l

    LEFT OUTER JOIN #Work r

    ON l.lRowNum = r.rRowNum

    WHERE l.Data <> r.Data OR r.Data IS NULL

    ORDER BY l.lRowNum

    --===== Housekeeping for reruns...

    DROP TABLE #Work

    If you want to know how the data "smear" actually works, please see the following article... it uses the same basic principle as the running total...

    [font="Arial Black"]

    Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5 (click me)[/font][/url]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually, a couple of tweaks to noel's version will make it much better:

    Drop table #YourTable

    GO

    Create table #YourTable(

    id int

    ,data char(1)

    ,date datetime

    ,RowNum int Identity(1,1)

    , primary key (id, RowNum, date))

    GO

    insert into #YourTable(id,data,date)

    Select *

    From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08') D

    select c1.id

    , c1.data

    , c1.date as date1

    , (select min(c3.date)

    from #YourTable c3

    where c1.id = c3.id

    and c1.RowNum < c3.RowNum

    and c1.data <> c3.data) date2

    from #YourTable c1

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (12/7/2008)


    Actually, a couple of tweaks to noel's version will make it much better:

    You need to check the output on that... 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need ,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"

    andADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])

    However, from there, I need to add it in the final join as well.

  • Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁

    Anyway, this corrected version gives correct results, while still retaining the improved performance.

    Drop table #YourTable

    GO

    Create table #YourTable(

    id int

    ,data char(1)

    ,date datetime

    ,RowNum int Identity(1,1)

    , primary key (id, RowNum, date))

    GO

    insert into #YourTable(id,data,date)

    Select *

    From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08') D

    select c1.id

    , c1.data

    , c1.date as date1

    , (select min(c3.date)

    from #YourTable c3

    where c1.id = c3.id

    and c1.RowNum < c3.RowNum

    and c1.data <> c3.data) date2

    from #YourTable c1

    left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum

    where c1.data <> c2.data or c2.data is null

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (12/8/2008)


    Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁

    Anyway, this corrected version gives correct results, while still retaining the improved performance.

    Drop table #YourTable

    GO

    Create table #YourTable(

    id int

    ,data char(1)

    ,date datetime

    ,RowNum int Identity(1,1)

    , primary key (id, RowNum, date))

    GO

    insert into #YourTable(id,data,date)

    Select *

    From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08') D

    select c1.id

    , c1.data

    , c1.date as date1

    , (select min(c3.date)

    from #YourTable c3

    where c1.id = c3.id

    and c1.RowNum < c3.RowNum

    and c1.data <> c3.data) date2

    from #YourTable c1

    left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum

    where c1.data <> c2.data or c2.data is null

    Uh, huh... now all ya gotta do is make it account for dupage... 😀

    Drop table #YourTable

    GO

    Create table #YourTable(

    id int

    ,data char(1)

    ,date datetime

    ,RowNum int Identity(1,1)

    , primary key (id, RowNum, date))

    GO

    insert into #YourTable(id,data,date)

    Select *

    From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

    union all select 1,'B','2/02/08'

    union all select 1,'C','2/03/08'

    union all select 1,'C','2/09/08'

    union all select 1,'C','2/22/08'

    union all select 1,'C','5/11/08'

    union all select 1,'A','6/01/08'

    union all select 1,'D','7/15/08'

    union all select 1,'C','7/21/08'

    union all select 1,'C','8/19/08'

    union all select 1,'A','9/17/08') D

    [font="Arial Black"]--===== Add a helmet full of duplicates ;-)

    WHILE @@ROWCOUNT <= 1000

    INSERT INTO #YourTable

    SELECT id,data,date FROM #YourTable[/font]

    select c1.id

    , c1.data

    , c1.date as date1

    , (select min(c3.date)

    from #YourTable c3

    where c1.id = c3.id

    and c1.RowNum < c3.RowNum

    and c1.data <> c3.data) date2

    from #YourTable c1

    left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum

    where c1.data <> c2.data or c2.data is null

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oblio Leitch (12/8/2008)


    Jeff,

    That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need ,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"

    andADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])

    However, from there, I need to add it in the final join as well.

    Ok... see the link in my signature below for how to post readily consumable data and, using the new data, post what the output should look like. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply