finding min/max while data remains unchanged

Question

finding min/max while data remains unchanged

Oblio Leitch

SSCrazy

Points: 2137
More actions
December 4, 2008 at 1:19 pm

#128331

What I have: id|data|date
1|A|'1/28/08'
1|B|'2/02/08'
1|C|'2/03/08'
1|C|'2/09/08'
1|C|'2/22/08'
1|C|'5/11/08'
1|A|'6/01/08'
1|D|'7/15/08'
1|C|'7/21/08'
1|C|'8/19/08'
1|A|'9/17/08'
What I want:id|data|date1|date2
1|A|'1/28/08'|'2/02/08'
1|B|'2/02/08'|'2/03/08'
1|C|'2/03/08'|'6/01/08'
1|A|'6/01/08'|'7/15/08'
1|D|'7/15/08'|'7/21/08'
1|C|'7/21/08'|'7/17/08'
1|A|'9/17/08'|null
In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

noeld SSC Guru Points: 96590 More actions · Answer 1

Oblio Leitch (12/4/2008)
What I have: id|data|date 1|A|'1/28/08' 1|B|'2/02/08' 1|C|'2/03/08' 1|C|'2/09/08' 1|C|'2/22/08' 1|C|'5/11/08' 1|A|'6/01/08' 1|D|'7/15/08' 1|C|'7/21/08' 1|C|'8/19/08'
1|A|'9/17/08'
What I want:id|data|date1|date2
1|A|'1/28/08'|'2/02/08'
1|B|'2/02/08'|'2/03/08'
1|C|'2/03/08'|'6/01/08'
1|A|'6/01/08'|'7/15/08'
1|D|'7/15/08'|'7/21/08'
1|C|'7/21/08'|'7/17/08'
1|A|'9/17/08'|null
In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?

declare @t table( id int, data char(1), date datetime)

insert into @t( id,data,date)

select 1,'A','1/28/08'

union all select 1,'B','2/02/08'

union all select 1,'C','2/03/08'

union all select 1,'C','2/09/08'

union all select 1,'C','2/22/08'

union all select 1,'C','5/11/08'

union all select 1,'A','6/01/08'

union all select 1,'D','7/15/08'

union all select 1,'C','7/21/08'

union all select 1,'C','8/19/08'

union all select 1,'A','9/17/08'

--select * from @t order by date

;with cte

AS(

select * , row_number() over(partition by id order by date, data) rn

from @t

)

select c1.id , c1.data

, c1.date as date1

,(select min(c3.date)

from cte c3

where c1.id = c3.id

and c1.rn < c3.rn

and c1.data <> c3.data) date2

from cte c1 left outer join cte c2

on c2.rn +1 = c1.rn

where c1.data <> c2.data or c2.data is null

order by c1.date

* Noel

Oblio Leitch SSCrazy Points: 2137 More actions · Answer 2

Excellent! Thank you. It looks like I was on the right track. I had the row_count() in there, but I was trying to use < rather than just stepping back down to the previous row. And, I knew I had to do a subquery to fetch the second date, but I was trying to do it looking forward. Just couldn't get my brain around it. Thank you for the assistance.

Some changes I made to fit my case - I added the id in the join to scale to multiple ids; I added additional data fields to check against. To do so, I OR'd them in the <> checks, and changed the IS NULL to look at the id rather than the data. Also, I ran into trouble with trying to compare data when it was null, so in my cte I ISNULL'd the data columns to be an empty string.

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 3

Um... I'm thinking that's not what you want. Look at the ACTUAL execution plan for that... tell me where, with only 11 rows in the original table, you come up with a 121 row scan. That's right... accidental cross join. 11 * 11 = 121. What do you think will happen with 10,000 rows in the original table? Right again... 100,000,000. That's 100 MILLION internal rows being generated just for a lousy 10,000 row table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 4

This will fly... even in the face of a million rows...

--===== Create the same test data as before.

-- This is NOT part of the solution.

declare @t table( id int, data char(1), date datetime)

insert into @t( id,data,date)

select 1,'A','1/28/08'

union all select 1,'B','2/02/08'

union all select 1,'C','2/03/08'

union all select 1,'C','2/09/08'

union all select 1,'C','2/22/08'

union all select 1,'C','5/11/08'

union all select 1,'A','6/01/08'

union all select 1,'D','7/15/08'

union all select 1,'C','7/21/08'

union all select 1,'C','8/19/08'

union all select 1,'A','9/17/08'

--===== Create the "Work" table

SELECT ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data),0) AS lRowNum,

ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data)-1,0) AS rRowNum,

*

INTO #Work

FROM @t

ALTER TABLE #Work

ADD PRIMARY KEY CLUSTERED (lRowNum)

--===== "Smear" the dates down in the work table

DECLARE @PrevId INT, @PrevData CHAR(1), @PrevDate DATETIME

SELECT @PrevId = 0, @PrevData = '', @PrevDate = 0

UPDATE #Work

SET @PrevDate = Date = CASE WHEN ID = @PrevID AND Data = @PrevData THEN @PrevDate ELSE Date END,

@PrevID = ID,

@PrevData = Data

FROM #Work WITH (INDEX(0))

--===== Select the final desired result set

SELECT l.ID, l.Data, l.Date AS Date1, r.Date AS Date2

FROM #Work l

LEFT OUTER JOIN #Work r

ON l.lRowNum = r.rRowNum

WHERE l.Data <> r.Data OR r.Data IS NULL

ORDER BY l.lRowNum

--===== Housekeeping for reruns...

DROP TABLE #Work

If you want to know how the data "smear" actually works, please see the following article... it uses the same basic principle as the running total...

[font="Arial Black"]

Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5 (click me)[/font][/url]

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 5

Actually, a couple of tweaks to noel's version will make it much better:

Drop table #YourTable

GO

Create table #YourTable(

id int

,data char(1)

,date datetime

,RowNum int Identity(1,1)

, primary key (id, RowNum, date))

GO

insert into #YourTable(id,data,date)

Select *

From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

union all select 1,'B','2/02/08'

union all select 1,'C','2/03/08'

union all select 1,'C','2/09/08'

union all select 1,'C','2/22/08'

union all select 1,'C','5/11/08'

union all select 1,'A','6/01/08'

union all select 1,'D','7/15/08'

union all select 1,'C','7/21/08'

union all select 1,'C','8/19/08'

union all select 1,'A','9/17/08') D

select c1.id

, c1.data

, c1.date as date1

, (select min(c3.date)

from #YourTable c3

where c1.id = c3.id

and c1.RowNum < c3.RowNum

and c1.data <> c3.data) date2

from #YourTable c1

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 6

rbarryyoung (12/7/2008)
Actually, a couple of tweaks to noel's version will make it much better:

You need to check the output on that... 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Oblio Leitch SSCrazy Points: 2137 More actions · Answer 7

Jeff,

That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need ,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"

andADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])

However, from there, I need to add it in the final join as well.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 8

Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁

Anyway, this corrected version gives correct results, while still retaining the improved performance.

Drop table #YourTable

GO

Create table #YourTable(

id int

,data char(1)

,date datetime

,RowNum int Identity(1,1)

, primary key (id, RowNum, date))

GO

insert into #YourTable(id,data,date)

Select *

From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

union all select 1,'B','2/02/08'

union all select 1,'C','2/03/08'

union all select 1,'C','2/09/08'

union all select 1,'C','2/22/08'

union all select 1,'C','5/11/08'

union all select 1,'A','6/01/08'

union all select 1,'D','7/15/08'

union all select 1,'C','7/21/08'

union all select 1,'C','8/19/08'

union all select 1,'A','9/17/08') D

select c1.id

, c1.data

, c1.date as date1

, (select min(c3.date)

from #YourTable c3

where c1.id = c3.id

and c1.RowNum < c3.RowNum

and c1.data <> c3.data) date2

from #YourTable c1

left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum

where c1.data <> c2.data or c2.data is null

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 9

rbarryyoung (12/8/2008)
Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁
Anyway, this corrected version gives correct results, while still retaining the improved performance.
Drop table #YourTable
GO
Create table #YourTable(
id int
,data char(1)
,date datetime
,RowNum int Identity(1,1)
, primary key (id, RowNum, date))
GO
insert into #YourTable(id,data,date)
Select *
From (Select 1 as [id],'A' as [data],'1/28/08' as [date]
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08') D
select c1.id
, c1.data
, c1.date as date1
, (select min(c3.date)
from #YourTable c3
where c1.id = c3.id
and c1.RowNum < c3.RowNum
and c1.data <> c3.data) date2
from #YourTable c1
left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum
where c1.data <> c2.data or c2.data is null

Uh, huh... now all ya gotta do is make it account for dupage... 😀

Drop table #YourTable

GO

Create table #YourTable(

id int

,data char(1)

,date datetime

,RowNum int Identity(1,1)

, primary key (id, RowNum, date))

GO

insert into #YourTable(id,data,date)

Select *

From (Select 1 as [id],'A' as [data],'1/28/08' as [date]

union all select 1,'B','2/02/08'

union all select 1,'C','2/03/08'

union all select 1,'C','2/09/08'

union all select 1,'C','2/22/08'

union all select 1,'C','5/11/08'

union all select 1,'A','6/01/08'

union all select 1,'D','7/15/08'

union all select 1,'C','7/21/08'

union all select 1,'C','8/19/08'

union all select 1,'A','9/17/08') D

[font="Arial Black"]--===== Add a helmet full of duplicates ;-)

WHILE @@ROWCOUNT <= 1000

INSERT INTO #YourTable

SELECT id,data,date FROM #YourTable[/font]

select c1.id

, c1.data

, c1.date as date1

, (select min(c3.date)

from #YourTable c3

where c1.id = c3.id

and c1.RowNum < c3.RowNum

and c1.data <> c3.data) date2

from #YourTable c1

left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum

where c1.data <> c2.data or c2.data is null

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 10

Oblio Leitch (12/8/2008)
Jeff,
That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need ,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"
andADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])
However, from there, I need to add it in the final join as well.

Ok... see the link in my signature below for how to post readily consumable data and, using the new data, post what the output should look like. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)