May 12, 2015 at 8:04 am
I would like to get extract the first row of some time based data where any of the other values changes. For example, in the following 10 rows
ABDTE
112015-01-01
112015-01-02
112015-01-03
212015-01-04
112015-01-05
1NULL2015-01-06
1NULL2015-01-07
212015-01-08
212015-01-09
222015-01-10
I want to remove the rows where A and B are the same as on the previous row. So rows 2, 3, 7 and 9 should be eliminated. Note that A and B can have the same values multiple times, just not in succession in the extract. I've tried ranking but I can't figure out how to keep it from lumping all the values of A and B in the same group. The following incorrectly eliminates rows 5 and 8:
;with data as (
select 1 as A, 1 as B, '2015-01-01' as DTE union
select 1 as A, 1 as B, '2015-01-02' as DTE union
select 1 as A, 1 as B, '2015-01-03' as DTE union
select 2 as A, 1 as B, '2015-01-04' as DTE union
select 1 as A, 1 as B, '2015-01-05' as DTE union
select 1 as A, NULL as B, '2015-01-06' as DTE union
select 1 as A, NULL as B, '2015-01-07' as DTE union
select 2 as A, 1 as B, '2015-01-08' as DTE union
select 2 as A, 1 as B, '2015-01-09' as DTE union
select 2 as A, 2 as B, '2015-01-10' as DTE)
select a, b, dte
from (select a, b, dte, rank() over (partition by A, B order by dte) as rn
from data) a
where rn = 1
ORDER BY dte;
Of course the real data has many columns and multiple data types that can have nulls. I just want get the row when anything changes.
Is there a slick way to do this in SQL?
Thanks
May 12, 2015 at 8:27 am
Something like this?
if object_id('tempdb..#date', 'U') is not null
drop table #date;
create table #date (a int, b int, dte date);
insert #date
(a
,b
,dte
)
select 1 as A
,1 as B
,'2015-01-01' as DTE
union
select 1 as A
,1 as B
,'2015-01-02' as DTE
union
select 1 as A
,1 as B
,'2015-01-03' as DTE
union
select 2 as A
,1 as B
,'2015-01-04' as DTE
union
select 1 as A
,1 as B
,'2015-01-05' as DTE
union
select 1 as A
,null as B
,'2015-01-06' as DTE
union
select 1 as A
,null as B
,'2015-01-07' as DTE
union
select 2 as A
,1 as B
,'2015-01-08' as DTE
union
select 2 as A
,1 as B
,'2015-01-09' as DTE
union
select 2 as A
,2 as B
,'2015-01-10' as DTE;
select *
from #date d
order by d.dte;
with ordered
as (select d.a
,d.b
,d.dte
,abefore = lag(a, 1, null) over (order by dte)
,bbefore = lag(b, 1, null) over (order by dte)
from #date d
)
--order by dte;
select *
from ordered o
where not exists ( select o.a
,o.b
intersect
select o.abefore
,o.bbefore );
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 12, 2015 at 1:25 pm
Thank you! That works really well.
May 12, 2015 at 6:36 pm
Perhaps this will work too?
with data as (
select 1 as A, 1 as B, '2015-01-01' as DTE union
select 1 as A, 1 as B, '2015-01-02' as DTE union
select 1 as A, 1 as B, '2015-01-03' as DTE union
select 2 as A, 1 as B, '2015-01-04' as DTE union
select 1 as A, 1 as B, '2015-01-05' as DTE union
select 1 as A, NULL as B, '2015-01-06' as DTE union
select 1 as A, NULL as B, '2015-01-07' as DTE union
select 2 as A, 1 as B, '2015-01-08' as DTE union
select 2 as A, 1 as B, '2015-01-09' as DTE union
select 2 as A, 2 as B, '2015-01-10' as DTE)
SELECT A, B, DTE
FROM
(
SELECT *
,rn1=ROW_NUMBER() OVER (PARTITION BY rn ORDER BY DTE)
FROM
(
select *
,rn=ROW_NUMBER() OVER (ORDER BY DTE) -
ROW_NUMBER() OVER (PARTITION BY A, B ORDER BY DTE)
FROM data
) a
) a
WHERE rn1=1
ORDER BY dte;
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
May 13, 2015 at 5:35 am
Thank you for the reply. Unfortunately this one does not work on the real data. Maybe it has to do with some additional information about the data that I did not think to include. Column A represents one or more key columns usually varchar but sometimes int. B represents one or more additional NULLable data columns that are of data types char, varchar, int, float, decimal or datetime. The datetime values are not necessarily unique even within a key.
May 13, 2015 at 6:52 am
tom.w.brannon (5/13/2015)
Thank you for the reply. Unfortunately this one does not work on the real data.
Which one?
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
May 13, 2015 at 7:33 am
Here's a slightly different approach...
IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp
GO
CREATE TABLE #temp (
A CHAR(1) NOT NULL,
B CHAR(1) NULL,
DTE DATE
)
INSERT #temp (A,B,DTE) VALUES
('1','1','2015-01-01'),
('1','1','2015-01-02'),
('1','1','2015-01-03'),
('2','1','2015-01-04'),
('1','1','2015-01-05'),
('1',NULL,'2015-01-06'),
('1',NULL,'2015-01-07'),
('2','1','2015-01-08'),
('2','1','2015-01-09'),
('2','2','2015-01-10')
;WITH BinVals AS (
SELECT
t.A,
t.B,
t.DTE,
CAST(t.A AS BINARY(4)) + CAST(COALESCE(t.B, 'NULL') AS BINARY(4)) BinValCur,
CAST(LAG(t.A, 1) OVER (ORDER BY t.DTE) AS BINARY(4)) + CAST(COALESCE(LAG(t.B, 1) OVER (ORDER BY t.DTE), 'NULL') AS BINARY(4)) BinValLag
FROM
#temp t
)
SELECT
bv.A,
bv.B,
bv.DTE
FROM BinVals bv
WHERE bv.BinValCur <> COALESCE(bv.BinValLag, '')
May 13, 2015 at 8:52 am
Sorry. I thought I was replying to Dwain's posted answer rather than making a general post. The first answer posted by Phil Parkin works with a tweak. The 'over (order by dte)' needs to be 'over (order by a, dte)' to keep the data separated on the key (A). I cannot make the suggestion by Dwain.c work. The data needs to be partitioned by A and ordered by A and DTE and then rows that show any change (including first row) kept. I get confused by the multiple row number functions. I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.
Thank you all very much for your efforts on this. The solutions are not anywhere near things I was considering and so I am learning things.
May 13, 2015 at 11:26 am
I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.
The CAST to BINARY is just an easy way to concatenate multiple columns of varying data types together to determine uniqueness across those columns. So, in this case, it may still be a good method to create the "A" keys and "B" values (assuming that each is comprised of multiple columns).
You are correct about setting the BINARY/VARBINARY lengths... At least to the extent that you do want prevent data truncation. Since you don't actually need to decompose the binary values to get the original values, you can get away with using a generic VARBINARY(N)... Where N is large enough to accommodate the larges of the original values.
As to the COALESCE to 'NULL'... That was just a arbitrary value (that didn't seem likely to exist in the actual data) that's used to prevent the "anything + NULL = NULL" scenario and to treat NULL as an actual value that could be repeated from row to row.
In any case, if you test all 3, it would interesting to see if any of them stands out as a clear winner in terms of performance.
May 13, 2015 at 12:07 pm
Thank you for the explanations Jason. I compared the first and third methods and have about the same costs (51% and 49% respectively) by very different execution plans. But the first method is consistently about 10% faster (3.7 sec vs 4.1 sec CPU). The table used is unindexed so there is one scan and same number of logical reads for both (13845 for 550K records and checking differences in only 3 columns).
May 13, 2015 at 12:34 pm
Thanks for the info Tom. Sounds like Phil's solution has the edge. I've never used the INTERSECT operator in anything before, so it looks like I've got some homework of my own to do.
May 13, 2015 at 1:14 pm
Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.
May 13, 2015 at 2:59 pm
tom.w.brannon (5/13/2015)
Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.
This article was where I borrowed the technique from. I use it all the time when writing MERGE statements.
Notice (in the comments at the end of the article) the suggestion that NOT EXISTS (...INTERSECT) can produce cleaner execution plans that EXISTS (...EXCEPT) – it is for this reason alone that I suggested doing things this way.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 13, 2015 at 5:47 pm
tom.w.brannon (5/13/2015)
Sorry. I thought I was replying to Dwain's posted answer rather than making a general post. The first answer posted by Phil Parkin works with a tweak. The 'over (order by dte)' needs to be 'over (order by a, dte)' to keep the data separated on the key (A). I cannot make the suggestion by Dwain.c work. The data needs to be partitioned by A and ordered by A and DTE and then rows that show any change (including first row) kept. I get confused by the multiple row number functions. I can also make Jason's answer work, again by adding A to the OVER (ORDER BY ) clauses. One disadvantage of Jason's method for me is that I believe I would need to set the VARBINARY length based on the data type and width of the column instead of just using a column name. I might also need to carefully select NULL substitute values.Thank you all very much for your efforts on this. The solutions are not anywhere near things I was considering and so I am learning things.
Sorry my solution didn't work for your real data. I suspect the issue is that you need to come up with the "just right" PARTITION BY/ORDER BY columns in the third ROW_NUMBER(). Clearly the data you posted suggested that the date would be the proper ORDER BY column, however without data that is closer to your real world example I probably can't suggest better.
It seems you've got a working solution, so I won't belabor the point any more. It is always good to hear it when people say they've learned something here on the SSC forums.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
May 13, 2015 at 8:13 pm
Phil Parkin (5/13/2015)
tom.w.brannon (5/13/2015)
Yes, intersect is not normal to me either. I actually changed it where exists and then used except. Still non-obvious logic for me so I am glad to have the model.This article was where I borrowed the technique from. I use it all the time when writing MERGE statements.
Notice (in the comments at the end of the article) the suggestion that NOT EXISTS (...INTERSECT) can produce cleaner execution plans that EXISTS (...EXCEPT) – it is for this reason alone that I suggested doing things this way.
Phil - Thank you for posting the link to the "Undocumented Query Plans: Equality Comparisons" article. I gave it a quick read through but it's going to be one that I'll need to spend some real time with before it sinks all the way in.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply