How to get row of first change?

Question

Post reply

How to get row of first change?

tom.w.brannon

Hall of Fame

Points: 3976
More actions
May 12, 2015 at 8:04 am

#398444

I would like to get extract the first row of some time based data where any of the other values changes. For example, in the following 10 rows
ABDTE
112015-01-01
112015-01-02
112015-01-03
212015-01-04
112015-01-05
1NULL2015-01-06
1NULL2015-01-07
212015-01-08
212015-01-09
222015-01-10
I want to remove the rows where A and B are the same as on the previous row. So rows 2, 3, 7 and 9 should be eliminated. Note that A and B can have the same values multiple times, just not in succession in the extract. I've tried ranking but I can't figure out how to keep it from lumping all the values of A and B in the same group. The following incorrectly eliminates rows 5 and 8:
;with data as (
select 1 as A, 1 as B, '2015-01-01' as DTE union
select 1 as A, 1 as B, '2015-01-02' as DTE union
select 1 as A, 1 as B, '2015-01-03' as DTE union
select 2 as A, 1 as B, '2015-01-04' as DTE union
select 1 as A, 1 as B, '2015-01-05' as DTE union
select 1 as A, NULL as B, '2015-01-06' as DTE union
select 1 as A, NULL as B, '2015-01-07' as DTE union
select 2 as A, 1 as B, '2015-01-08' as DTE union
select 2 as A, 1 as B, '2015-01-09' as DTE union
select 2 as A, 2 as B, '2015-01-10' as DTE)
select a, b, dte
from (select a, b, dte, rank() over (partition by A, B order by dte) as rn
from data) a
where rn = 1
ORDER BY dte;
Of course the real data has many columns and multiple data types that can have nulls. I just want get the row when anything changes.
Is there a slick way to do this in SQL?
Thanks

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247216 More actions · Answer 1

Something like this?

if object_id('tempdb..#date', 'U') is not null

drop table #date;

create table #date (a int, b int, dte date);

insert #date

(a

,b

,dte

)

select 1 as A

,1 as B

,'2015-01-01' as DTE

union

select 1 as A

,1 as B

,'2015-01-02' as DTE

union

select 1 as A

,1 as B

,'2015-01-03' as DTE

union

select 2 as A

,1 as B

,'2015-01-04' as DTE

union

select 1 as A

,1 as B

,'2015-01-05' as DTE

union

select 1 as A

,null as B

,'2015-01-06' as DTE

union

select 1 as A

,null as B

,'2015-01-07' as DTE

union

select 2 as A

,1 as B

,'2015-01-08' as DTE

union

select 2 as A

,1 as B

,'2015-01-09' as DTE

union

select 2 as A

,2 as B

,'2015-01-10' as DTE;

select *

from #date d

order by d.dte;

with ordered

as (select d.a

,d.b

,d.dte

,abefore = lag(a, 1, null) over (order by dte)

,bbefore = lag(b, 1, null) over (order by dte)

from #date d

)

--order by dte;

select *

from ordered o

where not exists ( select o.a

,o.b

intersect

select o.abefore

,o.bbefore );

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 2

tom.w.brannon

Hall of Fame

Points: 3976

May 12, 2015 at 1:25 pm

#1797368

Thank you! That works really well.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 3

Perhaps this will work too?

with data as (

select 1 as A, 1 as B, '2015-01-01' as DTE union

select 1 as A, 1 as B, '2015-01-02' as DTE union

select 1 as A, 1 as B, '2015-01-03' as DTE union

select 2 as A, 1 as B, '2015-01-04' as DTE union

select 1 as A, 1 as B, '2015-01-05' as DTE union

select 1 as A, NULL as B, '2015-01-06' as DTE union

select 1 as A, NULL as B, '2015-01-07' as DTE union

select 2 as A, 1 as B, '2015-01-08' as DTE union

select 2 as A, 1 as B, '2015-01-09' as DTE union

select 2 as A, 2 as B, '2015-01-10' as DTE)

SELECT A, B, DTE

FROM

(

SELECT *

,rn1=ROW_NUMBER() OVER (PARTITION BY rn ORDER BY DTE)

FROM

(

select *

,rn=ROW_NUMBER() OVER (ORDER BY DTE) -

ROW_NUMBER() OVER (PARTITION BY A, B ORDER BY DTE)

FROM data

) a

WHERE rn1=1

ORDER BY dte;

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 4

Thank you for the reply. Unfortunately this one does not work on the real data. Maybe it has to do with some additional information about the data that I did not think to include. Column A represents one or more key columns usually varchar but sometimes int. B represents one or more additional NULLable data columns that are of data types char, varchar, int, float, decimal or datetime. The datetime values are not necessarily unique even within a key.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 5

tom.w.brannon (5/13/2015)
Thank you for the reply. Unfortunately this one does not work on the real data.

Which one?

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 6

Here's a slightly different approach...

IF OBJECT_ID('tempdb..#temp') IS NOT NULL

DROP TABLE #temp

GO

CREATE TABLE #temp (

A CHAR(1) NOT NULL,

B CHAR(1) NULL,

DTE DATE

)

INSERT #temp (A,B,DTE) VALUES

('1','1','2015-01-01'),

('1','1','2015-01-02'),

('1','1','2015-01-03'),

('2','1','2015-01-04'),

('1','1','2015-01-05'),

('1',NULL,'2015-01-06'),

('1',NULL,'2015-01-07'),

('2','1','2015-01-08'),

('2','1','2015-01-09'),

('2','2','2015-01-10')

;WITH BinVals AS (

SELECT

t.A,

t.B,

t.DTE,

CAST(t.A AS BINARY(4)) + CAST(COALESCE(t.B, 'NULL') AS BINARY(4)) BinValCur,

CAST(LAG(t.A, 1) OVER (ORDER BY t.DTE) AS BINARY(4)) + CAST(COALESCE(LAG(t.B, 1) OVER (ORDER BY t.DTE), 'NULL') AS BINARY(4)) BinValLag

FROM

#temp t

)

SELECT

bv.A,

bv.B,

bv.DTE

FROM BinVals bv

WHERE bv.BinValCur <> COALESCE(bv.BinValLag, '')

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 7

Sorry. I thought I was replying to Dwain's posted answer rather than making a general post. The first answer posted by Phil Parkin works with a tweak. The 'over (order by dte)' needs to be 'over (order by a, dte)' to keep the data separated on the key (A). I cannot make the suggestion by Dwain.c work. The data needs to be partitioned by A and ordered by A and DTE and then rows that show any change (including first row) kept. I get confused by the multiple row number functions. I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.

Thank you all very much for your efforts on this. The solutions are not anywhere near things I was considering and so I am learning things.

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 8

I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.

The CAST to BINARY is just an easy way to concatenate multiple columns of varying data types together to determine uniqueness across those columns. So, in this case, it may still be a good method to create the "A" keys and "B" values (assuming that each is comprised of multiple columns).

You are correct about setting the BINARY/VARBINARY lengths... At least to the extent that you do want prevent data truncation. Since you don't actually need to decompose the binary values to get the original values, you can get away with using a generic VARBINARY(N)... Where N is large enough to accommodate the larges of the original values.

As to the COALESCE to 'NULL'... That was just a arbitrary value (that didn't seem likely to exist in the actual data) that's used to prevent the "anything + NULL = NULL" scenario and to treat NULL as an actual value that could be repeated from row to row.

In any case, if you test all 3, it would interesting to see if any of them stands out as a clear winner in terms of performance.

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 9

Thank you for the explanations Jason. I compared the first and third methods and have about the same costs (51% and 49% respectively) by very different execution plans. But the first method is consistently about 10% faster (3.7 sec vs 4.1 sec CPU). The table used is unindexed so there is one scan and same number of logical reads for both (13845 for 550K records and checking differences in only 3 columns).

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 10

Thanks for the info Tom. Sounds like Phil's solution has the edge. I've never used the INTERSECT operator in anything before, so it looks like I've got some homework of my own to do.

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 11

Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.

Phil Parkin SSC Guru Points: 247216 More actions · Answer 12

tom.w.brannon (5/13/2015)
Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.

This article was where I borrowed the technique from. I use it all the time when writing MERGE statements.

Notice (in the comments at the end of the article) the suggestion that NOT EXISTS (...INTERSECT) can produce cleaner execution plans that EXISTS (...EXCEPT) – it is for this reason alone that I suggested doing things this way.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 13

tom.w.brannon (5/13/2015)
Sorry. I thought I was replying to Dwain's posted answer rather than making a general post. The first answer posted by Phil Parkin works with a tweak. The 'over (order by dte)' needs to be 'over (order by a, dte)' to keep the data separated on the key (A). I cannot make the suggestion by Dwain.c work. The data needs to be partitioned by A and ordered by A and DTE and then rows that show any change (including first row) kept. I get confused by the multiple row number functions. I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.
Thank you all very much for your efforts on this. The solutions are not anywhere near things I was considering and so I am learning things.

Sorry my solution didn't work for your real data. I suspect the issue is that you need to come up with the "just right" PARTITION BY/ORDER BY columns in the third ROW_NUMBER(). Clearly the data you posted suggested that the date would be the proper ORDER BY column, however without data that is closer to your real world example I probably can't suggest better.

It seems you've got a working solution, so I won't belabor the point any more. It is always good to hear it when people say they've learned something here on the SSC forums.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jason A. Long SSC-Insane Points: 23790 More actions · Answer 14

Phil Parkin (5/13/2015)
tom.w.brannon (5/13/2015)
Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.
This article was where I borrowed the technique from. I use it all the time when writing MERGE statements.
Notice (in the comments at the end of the article) the suggestion that NOT EXISTS (...INTERSECT) can produce cleaner execution plans that EXISTS (...EXCEPT) – it is for this reason alone that I suggested doing things this way.

Phil - Thank you for posting the link to the "Undocumented Query Plans: Equality Comparisons" article. I gave it a quick read through but it's going to be one that I'll need to spend some real time with before it sinks all the way in.