December 4, 2008 at 1:19 pm
What I have: id|data|date
1|A|'1/28/08'
1|B|'2/02/08'
1|C|'2/03/08'
1|C|'2/09/08'
1|C|'2/22/08'
1|C|'5/11/08'
1|A|'6/01/08'
1|D|'7/15/08'
1|C|'7/21/08'
1|C|'8/19/08'
1|A|'9/17/08'
What I want:id|data|date1|date2
1|A|'1/28/08'|'2/02/08'
1|B|'2/02/08'|'2/03/08'
1|C|'2/03/08'|'6/01/08'
1|A|'6/01/08'|'7/15/08'
1|D|'7/15/08'|'7/21/08'
1|C|'7/21/08'|'7/17/08'
1|A|'9/17/08'|null
In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?
December 4, 2008 at 2:16 pm
Oblio Leitch (12/4/2008)
What I have:id|data|date
1|A|'1/28/08'
1|B|'2/02/08'
1|C|'2/03/08'
1|C|'2/09/08'
1|C|'2/22/08'
1|C|'5/11/08'
1|A|'6/01/08'
1|D|'7/15/08'
1|C|'7/21/08'
1|C|'8/19/08'
1|A|'9/17/08'
What I want:
id|data|date1|date2
1|A|'1/28/08'|'2/02/08'
1|B|'2/02/08'|'2/03/08'
1|C|'2/03/08'|'6/01/08'
1|A|'6/01/08'|'7/15/08'
1|D|'7/15/08'|'7/21/08'
1|C|'7/21/08'|'7/17/08'
1|A|'9/17/08'|null
In other words, I want to collapse the data by removing duplicates *only when the data changes from a previous record* and maintain the timeline. The "date2" column is simply "date1" of the next row. MIN() and MAX() seem like logical choices, except GROUP BY does not preserve the date stamp. Any suggestions?
declare @t table( id int, data char(1), date datetime)
insert into @t( id,data,date)
select 1,'A','1/28/08'
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08'
--select * from @t order by date
;with cte
AS(
select * , row_number() over(partition by id order by date, data) rn
from @t
)
select c1.id , c1.data
, c1.date as date1
,(select min(c3.date)
from cte c3
where c1.id = c3.id
and c1.rn < c3.rn
and c1.data <> c3.data) date2
from cte c1 left outer join cte c2
on c2.rn +1 = c1.rn
where c1.data <> c2.data or c2.data is null
order by c1.date
* Noel
December 5, 2008 at 6:27 am
Excellent! Thank you. It looks like I was on the right track. I had the row_count() in there, but I was trying to use < rather than just stepping back down to the previous row. And, I knew I had to do a subquery to fetch the second date, but I was trying to do it looking forward. Just couldn't get my brain around it. Thank you for the assistance.
Some changes I made to fit my case - I added the id in the join to scale to multiple ids; I added additional data fields to check against. To do so, I OR'd them in the <> checks, and changed the IS NULL to look at the id rather than the data. Also, I ran into trouble with trying to compare data when it was null, so in my cte I ISNULL'd the data columns to be an empty string.
December 6, 2008 at 7:30 pm
Um... I'm thinking that's not what you want. Look at the ACTUAL execution plan for that... tell me where, with only 11 rows in the original table, you come up with a 121 row scan. That's right... accidental cross join. 11 * 11 = 121. What do you think will happen with 10,000 rows in the original table? Right again... 100,000,000. That's 100 MILLION internal rows being generated just for a lousy 10,000 row table.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2008 at 9:08 pm
This will fly... even in the face of a million rows...
--===== Create the same test data as before.
-- This is NOT part of the solution.
declare @t table( id int, data char(1), date datetime)
insert into @t( id,data,date)
select 1,'A','1/28/08'
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08'
--===== Create the "Work" table
SELECT ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data),0) AS lRowNum,
ISNULL(ROW_NUMBER() OVER(ORDER BY ID,Date, Data)-1,0) AS rRowNum,
*
INTO #Work
FROM @t
ALTER TABLE #Work
ADD PRIMARY KEY CLUSTERED (lRowNum)
--===== "Smear" the dates down in the work table
DECLARE @PrevId INT, @PrevData CHAR(1), @PrevDate DATETIME
SELECT @PrevId = 0, @PrevData = '', @PrevDate = 0
UPDATE #Work
SET @PrevDate = Date = CASE WHEN ID = @PrevID AND Data = @PrevData THEN @PrevDate ELSE Date END,
@PrevID = ID,
@PrevData = Data
FROM #Work WITH (INDEX(0))
--===== Select the final desired result set
SELECT l.ID, l.Data, l.Date AS Date1, r.Date AS Date2
FROM #Work l
LEFT OUTER JOIN #Work r
ON l.lRowNum = r.rRowNum
WHERE l.Data <> r.Data OR r.Data IS NULL
ORDER BY l.lRowNum
--===== Housekeeping for reruns...
DROP TABLE #Work
If you want to know how the data "smear" actually works, please see the following article... it uses the same basic principle as the running total...
Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5 (click me)[/font][/url]
--Jeff Moden
Change is inevitable... Change for the better is not.
December 7, 2008 at 1:07 pm
Actually, a couple of tweaks to noel's version will make it much better:
Drop table #YourTable
GO
Create table #YourTable(
id int
,data char(1)
,date datetime
,RowNum int Identity(1,1)
, primary key (id, RowNum, date))
GO
insert into #YourTable(id,data,date)
Select *
From (Select 1 as [id],'A' as [data],'1/28/08' as [date]
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08') D
select c1.id
, c1.data
, c1.date as date1
, (select min(c3.date)
from #YourTable c3
where c1.id = c3.id
and c1.RowNum < c3.RowNum
and c1.data <> c3.data) date2
from #YourTable c1
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
December 8, 2008 at 5:44 am
rbarryyoung (12/7/2008)
Actually, a couple of tweaks to noel's version will make it much better:
You need to check the output on that... 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
December 8, 2008 at 8:13 am
Jeff,
That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need ,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"
andADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])
However, from there, I need to add it in the final join as well.
December 8, 2008 at 6:33 pm
Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁
Anyway, this corrected version gives correct results, while still retaining the improved performance.
Drop table #YourTable
GO
Create table #YourTable(
id int
,data char(1)
,date datetime
,RowNum int Identity(1,1)
, primary key (id, RowNum, date))
GO
insert into #YourTable(id,data,date)
Select *
From (Select 1 as [id],'A' as [data],'1/28/08' as [date]
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08') D
select c1.id
, c1.data
, c1.date as date1
, (select min(c3.date)
from #YourTable c3
where c1.id = c3.id
and c1.RowNum < c3.RowNum
and c1.data <> c3.data) date2
from #YourTable c1
left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum
where c1.data <> c2.data or c2.data is null
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
December 8, 2008 at 7:08 pm
rbarryyoung (12/8/2008)
Dang, every time I see that "NOT EXISTS using a JOIN" trick, I think that it's a No-Op and I take it out... 🙁Anyway, this corrected version gives correct results, while still retaining the improved performance.
Drop table #YourTable
GO
Create table #YourTable(
id int
,data char(1)
,date datetime
,RowNum int Identity(1,1)
, primary key (id, RowNum, date))
GO
insert into #YourTable(id,data,date)
Select *
From (Select 1 as [id],'A' as [data],'1/28/08' as [date]
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08') D
select c1.id
, c1.data
, c1.date as date1
, (select min(c3.date)
from #YourTable c3
where c1.id = c3.id
and c1.RowNum < c3.RowNum
and c1.data <> c3.data) date2
from #YourTable c1
left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum
where c1.data <> c2.data or c2.data is null
Uh, huh... now all ya gotta do is make it account for dupage... 😀
Drop table #YourTable
GO
Create table #YourTable(
id int
,data char(1)
,date datetime
,RowNum int Identity(1,1)
, primary key (id, RowNum, date))
GO
insert into #YourTable(id,data,date)
Select *
From (Select 1 as [id],'A' as [data],'1/28/08' as [date]
union all select 1,'B','2/02/08'
union all select 1,'C','2/03/08'
union all select 1,'C','2/09/08'
union all select 1,'C','2/22/08'
union all select 1,'C','5/11/08'
union all select 1,'A','6/01/08'
union all select 1,'D','7/15/08'
union all select 1,'C','7/21/08'
union all select 1,'C','8/19/08'
union all select 1,'A','9/17/08') D
[font="Arial Black"]--===== Add a helmet full of duplicates ;-)
WHILE @@ROWCOUNT <= 1000
INSERT INTO #YourTable
SELECT id,data,date FROM #YourTable[/font]
select c1.id
, c1.data
, c1.date as date1
, (select min(c3.date)
from #YourTable c3
where c1.id = c3.id
and c1.RowNum < c3.RowNum
and c1.data <> c3.data) date2
from #YourTable c1
left outer join #YourTable c2 on c2.RowNum +1 = c1.RowNum
where c1.data <> c2.data or c2.data is null
--Jeff Moden
Change is inevitable... Change for the better is not.
December 8, 2008 at 7:10 pm
Oblio Leitch (12/8/2008)
Jeff,That looks pretty good, but I was just working with a reduced dataset as an example. I need to expand it to use groups of IDs. I need
,ISNULL(ROW_NUMBER() OVER(PARTITION BY [ID] ORDER BY [ID],[Date],[Data]),0) "lRowNum"
and
ADD PRIMARY KEY CLUSTERED ([ID],[lRowNum])
However, from there, I need to add it in the final join as well.
Ok... see the link in my signature below for how to post readily consumable data and, using the new data, post what the output should look like. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply