Introduction
Often we come across situations where duplicate rows exist in a table, and a need arises to eliminate the duplicates. The Row_Number() Over(Partition By...Order by...) feature in Microsoft SQL Server 2005 and 2008 can be used efficiently for eliminating such duplicates. In SQL Server 2000, a program to eliminate duplicates used to be a bit long, involving self-joins, temporary tables, and identity columns. A confluence of derived tables and the Row_Number() Over() function in SQL Server 2005 and 2008 makes this elimination process simple. Let's look into it with an example. Please note that the code in this article is compatible with both SQL Server 2005 and 2008.
Building the scenario
Create a table like the one below:
create table Emp_Details ( Emp_Name varchar(10) , Company varchar(15) , Join_Date datetime , Resigned_Date datetime ) go
Insert some sample values into this table using the code below. Ensure that you insert at least one duplicate and one triplicate for an easier understanding of the help that the Row_Number() Over() feature offers us:
insert into Emp_Details (Emp_Name, Company, Join_Date, Resigned_Date) values ('John', 'Software', '20060101', '20061231') ,('John', 'Software', '20060101', '20061231') ,('John', 'Software', '20060101', '20061231') ,('John', 'SuperSoft', '20070101', '20071231') ,('John', 'UltraSoft', '20070201', '20080131') ,('John', 'ImproSoft', '20080201', '20081231') ,('John', 'ImproSoft', '20080201', '20081231') ,('Mary', 'Software', '20060101', '20081231') ,('Mary', 'SuperSoft', '20090101', '20090531') ,('Mary', 'SuperSoft', '20090101', '20090531') ,('Mary', 'UltraSoft', '20090601', '20100531') ,('Mary', 'UltraSoft', '20090601', '20100531')
Now, let's see the duplicates and triplicates using the old method where we count the records and use Group By. Please note that we are looking for an exact replica of a row in order ot call it a duplicate in this example:
select *, COUNT(*) from Emp_Details group by Emp_Name, Company, Join_Date, Resigned_Date
The results:
Emp_Name Company Join_Date Resigned_Date Occurrences
-------- ----------- ------------ --------------- ------------
John ImproSoft 2008-02-01 2008-12-31 2
John Software 2006-01-01 2006-12-31 3
John SuperSoft 2007-01-01 2007-12-31 1
John UltraSoft 2007-02-01 2008-01-31 1
Mary Software 2006-01-01 2008-12-31 1
Mary SuperSoft 2009-01-01 2009-05-31 2
Mary UltraSoft 2009-06-01 2010-05-31 2
Although the table has twelve rows, only seven are good ones as seen in the result set above. The remaining are duplicates or triplicates as we see from the Occurrences column above.
A better way of seeing the duplicates & triplicates is the query below where Row_Number() Over() has been used with the Partition By clause. The Row_Number() Over() function is looking for rows with the same values of Emp_Name, Company, Join_Date and Resigned_Date columns in the Emp_Details table. The first occurrence of this combination of columns is being allocated a RowNumber of 1. The subsequent occurrences of the same combination of data are being allocated RowNumber of 2, 3 and so on. When a new combination of Emp_Name, Company, Join_Date and Resigned_Date columns is encountered, that set is treated as a new partition and the RowNumber starts from 1 again thanks to the Partition By clause. In essence, the columns in the Partition By clause are being grouped together as per the Partition BY clause and then ordered using the Order By clause:
select Emp_Name ,Company ,Join_Date ,Resigned_Date ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date ,Resigned_Date order by Emp_Name, Company, Join_Date ,Resigned_Date) RowNumber from Emp_Details
The results:
Emp_Name Company Join_Date Resigned_Date RowNumber
-------- ------------ -------------- -------------- ----------
John ImproSoft 2008-02-01 2008-12-31 1
John ImproSoft 2008-02-01 2008-12-31 2
John Software 2006-01-01 2006-12-31 1
John Software 2006-01-01 2006-12-31 2
John Software 2006-01-01 2006-12-31 3
John SuperSoft 2007-01-01 2007-12-31 1
John UltraSoft 2007-02-01 2008-01-31 1
Mary Software 2006-01-01 2008-12-31 1
Mary SuperSoft 2009-01-01 2009-05-31 1
Mary SuperSoft 2009-01-01 2009-05-31 2
Mary UltraSoft 2009-06-01 2010-05-31 1
Mary UltraSoft 2009-06-01 2010-05-31 2
Wherever the column RowNumber is greater than 1 in the result set above, it is a duplicate row. For example, the second row in the result set above with RowNumber 2 is a duplicate of the first row with RowNumber 1. Similarly, the fifth row with RowNumber value of 3 and the fourth row with RowNumber value of 2 are triplicate and duplicate respectively of the third row with RowNumber value of 1.
So let us add a WHERE clause to the query above and execute it to get the actual duplicates and triplicates:
select a.Emp_Name, a.Company, a.Join_Date, a.Resigned_Date, a.RowNumber from (select Emp_Name ,Company ,Join_Date ,Resigned_Date ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date ,Resigned_Date order by Emp_Name, Company, Join_Date ,Resigned_Date) RowNumber from Emp_Details) a where a.RowNumber > 1
Here is the result set that shows the duplicates and triplicates:
Emp_Name Company Join_Date Resigned_Date RowNumber -------- ----------- ----------- ------------- --------- John ImproSoft 2008-02-01 2008-12-31 2 John Software 2006-01-01 2006-12-31 2 John Software 2006-01-01 2006-12-31 3 Mary SuperSoft 2009-01-01 2009-05-31 2 Mary UltraSoft 2009-06-01 2010-05-31 2
Removing Duplicates
Now lets remove the duplicates/triplicates in one query in an efficient way using Row_Number() Over() with the Partition By clause. Since we have identified the duplicates/triplicates as the rows where RowNumber is greater than 1 above, all we need to do is delete such records.
The TSQL code below has three parts. The first part is the derived table, a. It assigns a value to all the rows for the column RowNumber in the table Emp_Details. The first occurrence of a combination of Emp_Name, Company, Join_Date and Resigned_Date is allocated a RowNumber value of 1. Subsequent occurrences of the same combination of Emp_Name, Company, Join_Date and Resigned_Date are allocated subsequently higher numbers for the column RowNumber. This is ensured by the Order By clause in the Row_Number() Over() feature. The role of the Partition By clause is to ensure that the value in the column RowNumber starts afresh from 1 for each new combination of the columns, Emp_Name, Company, Join_Date and Resigned_Date.
Next comes the WHERE clause which identifies those records in the result set where the the RowNumber is greater than 1. These are the duplicate rows. The third part is the DELETE statement which deletes such records with a RowNumber value greater than 1. Let us execute this query to remove the duplications:
delete from a from (select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date ,Resigned_Date order by Emp_Name, Company, Join_Date ,Resigned_Date) RowNumber from Emp_Details) a where a.RowNumber > 1
Let's execute the code below to verify if the duplications have been removed:
select *, COUNT(*) Occurrences from Emp_Details group by Emp_Name, Company, Join_Date, Resigned_Date
Here is the result set:
Emp_Name Company Join_Date Resigned_Date Occurrences
-------- ----------- ------------- ------------- ------------
John ImproSoft 2008-02-01 2008-12-31 1
John Software 2006-01-01 2006-12-31 1
John SuperSoft 2007-01-01 2007-12-31 1
John UltraSoft 2007-02-01 2008-01-31 1
Mary Software 2006-01-01 2008-12-31 1
Mary SuperSoft 2009-01-01 2009-05-31 1
Mary UltraSoft 2009-06-01 2010-05-31 1
Please note that the duplicates/triplicates are gone.
Alternate way
There are several alternative methods to eliminate the duplicate records. I would like to explore the Row_Number() Over() function without using the Partition BY clause. This exercise is intended to bring out the efficiency of the Partition BY clause. A slightly more involved way of eliminating duplicates and triplicates would have been using Row_Number() Over() without the Partition By clause. Please remember to completely reload the table, Emp_Details before proceeding further using the TSQL code below:
truncate table Emp_Details insert into Emp_Details (Emp_Name, Company, Join_Date, Resigned_Date) values ('John', 'Software', '20060101', '20061231') ,('John', 'Software', '20060101', '20061231') ,('John', 'Software', '20060101', '20061231') ,('John', 'SuperSoft', '20070101', '20071231') ,('John', 'UltraSoft', '20070201', '20080131') ,('John', 'ImproSoft', '20080201', '20081231') ,('John', 'ImproSoft', '20080201', '20081231') ,('Mary', 'Software', '20060101', '20081231') ,('Mary', 'SuperSoft', '20090101', '20090531') ,('Mary', 'SuperSoft', '20090101', '20090531') ,('Mary', 'UltraSoft', '20090601', '20100531') ,('Mary', 'UltraSoft', '20090601', '20100531')
Before we proceed further, let me explain the Row_Number() Over() function when used without the Partition By clause as in the code below. It basically assigns a monotonically increasing value for the column RowNumber based on the ordering of the results by Emp_Name, Company, Join_Date and Resigned_Date. Unlike Partition BY, which we saw above, here the RowNumber column looks more like an identity column that does not get re-seeded with changing data in the columns because there is no Partition BY clause here. Please note that the column RowNumber is not an identity column.
select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details
Here is the result set:
Emp_Name Company Join_Date Resigned_Date RowNumber -------- ---------- ------------ --------------- --------- John ImproSoft 2008-02-01 2008-12-31 1 John ImproSoft 2008-02-01 2008-12-31 2 John Software 2006-01-01 2006-12-31 3 John Software 2006-01-01 2006-12-31 4 John Software 2006-01-01 2006-12-31 5 John SuperSoft 2007-01-01 2007-12-31 6 John UltraSoft 2007-02-01 2008-01-31 7 Mary Software 2006-01-01 2008-12-31 8 Mary SuperSoft 2009-01-01 2009-05-31 9 Mary SuperSoft 2009-01-01 2009-05-31 10 Mary UltraSoft 2009-06-01 2010-05-31 11 Mary UltraSoft 2009-06-01 2010-05-31 12
Now, lets use this Row_Number() Over() function without the Partition BY clause to search for duplications. The code below shows the extra rows in the table, Emp_Details, that are either duplicates or triplicates. Note that in this query, the Partition By clause has not been used within the Row_Number() Over() function. Here we have two derived tables, a and b. Both the derived tables have the same purpose i.e. they generate a column called RowNumber for a given combination of the columns Emp_Name, Company, Join_Date and Resigned_Date after ordering them by Emp_Name, Company, Join_Date and Resigned_Date using the Row_Number() Over(Order By..) feature.
The next step joins these two derived tables on the same columns Emp_Name, Company, Join_Date and Resigned_Date. Then, the WHERE clause checks for existence of a RowNumber in the first derived table that is greater than the one in the second derived table given that the values of Emp_Name, Company, Join_Date and Resigned_Date are same in both the tables. In other words it is checking for existence of duplications in the first derived table. And finally, the SELECT statement below displays such duplications. Let's execute the TSQL code below:
select a.* from (select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a inner join (select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b on a.Emp_Name = b.Emp_Name and a.Company = b.Company and a.Join_Date = b.Join_Date and a.Resigned_Date = b.Resigned_Date where a.RowNumber = b.RowNumber + 1
Here is the result set:
Emp_Name Company Join_Date Resigned_Date RowNumber
-------- ----------- ------------ --------------- ---------
John ImproSoft 2008-02-01 2008-12-31 2
John Software 2006-01-01 2006-12-31 4
John Software 2006-01-01 2006-12-31 5
Mary SuperSoft 2009-01-01 2009-05-31 10
Mary UltraSoft 2009-06-01 2010-05-31 12
The result set above shows the records which are duplications of other records.
It may be noted that the code to check for the existence of duplicates without using the Partition BY clause is much longer when compared to the code using the Partition BY clause apart from being more expensive as well. Please see the attached execution plans for both the statements, one with the Partition BY clause (Query Cost: 29%) and one using only the Order By clause (Query Cost: 71%).
The result set above shows the records which are duplications of other records.
We can delete these duplicates/triplicates using the following code (without the Partition BY clause). In the statement below, we just replace the "SELECT *" with DELETE. Please execute the code below:
delete from a from (select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a inner join (select Emp_Name, Company, Join_Date, Resigned_Date ,ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b on a.Emp_Name = b.Emp_Name and a.Company = b.Company and a.Join_Date = b.Join_Date and a.Resigned_Date = b.Resigned_Date where a.RowNumber = b.RowNumber + 1
Lets execute the code below to verify if the duplications have been removed:
select *, COUNT(*) Occurrences from Emp_Details group by Emp_Name, Company, Join_Date, Resigned_Date
Here is the result set:
Emp_Name Company Join_Date Resigned_Date Occurrences
-------- ------------ ----------- -------------- -----------
John ImproSoft 2008-02-01 2008-12-31 1
John Software 2006-01-01 2006-12-31 1
John SuperSoft 2007-01-01 2007-12-31 1
John UltraSoft 2007-02-01 2008-01-31 1
Mary Software 2006-01-01 2008-12-31 1
Mary SuperSoft 2009-01-01 2009-05-31 1
Mary UltraSoft 2009-06-01 2010-05-31 1
You may note that the duplicates/triplicates are gone.
Conclusion
In this particular scenario where duplications had to be removed, we used the Row_Number() Over() feature in SQL Server 2005/2008. In the first method, we used it with the Partition BY clause. In the second, we used it without the Partition By clause. Although our purpose of removing duplications was served by both the methods, usage of the Partition BY clause in the Row_Number () OVER() feature comes out as a more efficient method in the following respects:
- Need to scan the Emp_Details table only once.
- Need to have only one derived table instead of two.
- No need to join two derived tables.
- Consequently less CPU & Memory utilization.
Solving the problem of removing duplicate rows in Microsoft SQL Server 2000 and earlier is a reasonably lengthy code involving usage of a temporary table, an explicirtly created identity column in it and a join with itself. However, in this article I wanted to stress the contribution of the Partition By clause within the Row_Number() Over() function in helping keep the code shorter , faster and less demanding on resources while eliminating unwanted duplicate rows.