July 26, 2010 at 8:57 pm
Comments posted to this topic are about the item Using Ranking Functions to Deduplicate Data
July 26, 2010 at 11:19 pm
Nice article, thanks for sharing.
The same thing can be achieved with just the row_number function as well.
select AlphaKey
from (
select row_number() over(partition by AlphaKey order by AlphaKey) rownum,
AlphaKey
from @AlphaList
) al
where rownum = 1
order by 1;
July 27, 2010 at 12:50 am
yep, i too use this query to remove duplicates...
was wondering, which query is faster, this one or the one mentioned in the article...
scottm30 (7/26/2010)
Nice article, thanks for sharing.The same thing can be achieved with just the row_number function as well.
select AlphaKey
from (
select row_number() over(partition by AlphaKey order by AlphaKey) rownum,
AlphaKey
from @AlphaList
) al
where rownum = 1
order by 1;
July 27, 2010 at 1:35 am
with the A,B,C,D,E example, the select distinct syntax would be better imho.
July 27, 2010 at 1:55 am
How about good ol' grouping to remove duplicates and supply any of the minimum, maximum, sum, count, average, etc. etc?
select ItemNumber, min(UnitCost) as MinCost from MissingItems
group by ItemNumber;
ItemNumberMinCost
77710.10
88813.13
99916.16
July 27, 2010 at 2:31 am
I was interested to see what the difference in performance between these queries as I would always have used Row_Number with a partition before.
On the Northwind database (SQL 2005) I ran the following 2 queries
select
a.ProductID
,a.UnitPrice
from
(select
ProductID
,UnitPrice
,Rank() over (order by ProductID) as Ranking
,Row_Number() over (order by ProductID,UnitPrice) as RowNumber
from
[Order Details]
) as a
where
a.Ranking=a.RowNumber
order by
a.ProductID
select
a.ProductID
,a.UnitPrice
from
(select
ProductID
,UnitPrice
,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber
from
[Order Details]
) as a
where
a.RowNumber=1
order by
a.ProductID
In the Actual Execution Plan, the first query showed up as 64% and the second as 36%.
Far from exhaustive, but I'll probably stick to Row_Number with Partition by.
In (partial) answer to Phil Wood, if all I wanted was the lowest price for each product then I would use grouping, I'd probably use this method if I wanted to find the customer or order date as well, changing the code to
select
a.ProductID
,a.UnitPrice
,b.CustomerID
,b.OrderDate
from
(select
ProductID
,UnitPrice
,OrderID
,Row_Number() over (partition by ProductID order by UnitPrice) as RowNumber
from
[Order Details]
) as a
inner join
Orders as b
on
a.OrderID=b.OrderID
where
a.RowNumber=1
order by
a.ProductID
July 27, 2010 at 2:54 am
I agree. The simplest way to deduplicate is using grouping and aggregate functions as needed.
regards
Γron
July 27, 2010 at 4:05 am
I had to remove unitprice from the orderby to get deduplicated results.
declare @TempTable table (itemnumber int, unitprice decimal(18, 2));
insert into @TempTable values
(777, 10.10),
(777, 11.11),
(777, 12.12),
(888, 13.13),
(888, 14.14),
(888, 15.15),
(999, 16.16),
(999, 17.17),
(999, 18.18);
with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as
(
select
RANK() over (order by ItemNumber) as Rank,
ROW_NUMBER() over (order by ItemNumber) as RowNumber,
itemnumber,
unitprice
from @TempTable
) select * from GetMissingItems
where Rank = RowNumber;
Bhavesh
.NET and SQL Server Blog
July 27, 2010 at 4:32 am
In the example you provided you used one column. Say I have 4 columns and I want to deduplicate based on those. In other words I want to rank against those columns -- how would the SQL statements change?
July 27, 2010 at 4:46 am
messineo,
If you have multiple columns that you want to deduplicate on you will need to include all of them in whichever method you use.
Here is some sample code using each method that should work (all based on the Northwind database). Note that in all these cases the data will be de-duplicated but which row is selected for each set of duplicates can not be predicted (so if you want the first order for each set you will need different code).
select
a.ProductID
,a.UnitPrice
,a.Quantity
,a.Discount
from
(select
ProductID
,UnitPrice
,Quantity
,Discount
,Rank() over (order by ProductID,UnitPrice,Quantity,Discount) as Ranking
,Row_Number() over (order by ProductID,UnitPrice,Quantity,Discount) as RowNumber
from
[Order Details]
) as a
where
a.Ranking=a.RowNumber
select
a.ProductID
,a.UnitPrice
,a.Quantity
,a.Discount
from
(select
ProductID
,UnitPrice
,Quantity
,Discount
,Row_Number() over (partition by ProductID,UnitPrice,Quantity,Discount order by ProductID) as RowNumber
from
[Order Details]
) as a
where
a.RowNumber=1
select
ProductID
,UnitPrice
,Quantity
,Discount
from
[Order Details]
group by
ProductID
,UnitPrice
,Quantity
,Discount
July 27, 2010 at 6:11 am
I also generally use a partitioned row_number(), ordering on some selection criteria (highest number of x with provider y, etc).
Is the output of Rank defined by ANSI to always support this method in all implementations? My guess is that it's only guaranteed to be in order, but that the density of the default invocation could vary.
Still, this is clever and hadn't occurred to me before, so thanks for expanding my box.
July 27, 2010 at 6:42 am
Nice article. Gives me another way to de-dupe data. π
I use almost the same technique, but I have a time stamp associated with each record and this time stamp differs by milliseconds. So I just pick the one which has the latest time stamp. I should point out that when you use Ranking functions on a table that has really large number of records (in my case it was close to 90 millions some times), the query performance can degrade drastically (I didnt even have an index to begin with).
July 27, 2010 at 7:14 am
I personally have used the ranking functions quite a bit lately not to just remove duplicates but to be able to qualify what is removed. In one case, I needed to qualify one field based on the number of characters in a column and look at another column that may have one of two choices in it - "Reject" or "Reschedule". In this case, Reject weighed more than Reschedule. The ranking of both of the columns was added together to come up with the total ranking. Then all of the rows that were not the highest rank [based on a single column] were removed.
July 27, 2010 at 7:44 am
The approach is/was unique and creative, and for that, it was an excellent article. But considering there are at least 2 other well known approaches to this problem, I feel that the article was missing the performance comparison that not only compared with the other approaches, but alsoo compared all approaches on a multi-column approach. Perhaps this was an excellent article as it left me, the reader, wanting lots more π
July 27, 2010 at 9:06 am
I find the RANK function to be much better used in situations where I might not be able to group by all fields in the select statement. Another way I have used it before is to find not the latest record, but the SECOND latest record by choosing where RANK = 2.
-------------------------------------------------------------------------------------------------
My SQL Server Blog
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply