Overview
This article presents an approach to showing and deleting duplicate records that is concise, generic and efficient. It uses some features that were introduced in SQL Server 2005.
History and Background
Dealing with duplicate records in a table is a subject that has been covered many times in SQLServerCentral and elsewhere. Four good articles are listed below.
- Removing Duplicate Rows By Neil Boyle, 2001/05/14. This covers a number of different methods for removing duplicates with an examples of each method.
- Finding and Deleting Duplicate Data By Chris Cubley, 2003/07/25. This is a great article that covers identifying and efficiently removing duplicate data using an example from a Payments table.
- Remove Duplicate Records By Syed Iqbal, 2008/05/26. This article presents a novel method for removing duplicates that involves a WHILE loop that deletes one record at a time.
- Handle Duplicate Records By Erik Andersen, 2009/01/07 presents a nested cursor based solution that gives the DBA a chance to insert logic to combine records or audit deletions.
Why did I want to re-visit this subject again when it has been covered so many times?
There are two features that were introduced with SQL Server 2005 that can be applied to the duplicates problem and allows an efficient piece of code that does not need much modification to deal with each different real table.
- The row_number() function can allow a flexible identification of different records within a group.
- SQLCMD mode with SETVAR allows table names and column lists to be replaced in scripts without the use of dynamic SQL
The row_number() function allows records in a duplicate group to be distinguished in a generic fashion that does not rely on cursors or any particular distinguishing column. This is illustrated in the examples below.
The use of SETVAR in SQLCMD mode allows SQL statements to be written generically, with variable table and column names but without using dynamic SQL. This means the script can be used without needing to re-write parts to handle different situations. Using the script is just a matter of changing the SETVAR statements. The script is easier to read than dynamic SQL, especially in Management Studio because the query editor still color-codes the keywords.
Note that SETVAR cannot replace dynamic SQL in all situations because SETVAR can only accept a constant value, not a variable or expression on the right hand side,
Let's get to the script without more ado.
The Script Sections
The following script does four things:
- Set the SQLCMD variables using SETVAR.
- Counts the duplicates in $(TableName) based on $(UniqueColumnList)
- Displays a sample of the duplicates.
The sampling is controlled by $(SampleSize) and $(MaxRowsPerGroup). - Deletes the duplicates.
Generally I leave this commented out until I'm sure that I want to delete.
Section 1: The SETVARs
:SETVAR TableName Person.Contact -- Name of table to de-duplicate |
The SETVAR parameters above are designed to be used in the AdventureWorks database to find cases where the Person.Contact table has records with duplicate combinations of the FirstName and Phone columns. When displaying duplicates a sample of 20 duplicates is displayed with only 2 records in each duplicate group.
TableName is the name of the table containing duplicates. It can be any table name even using 3 or 4 part names. Square brackets are needed if the table name contains special characters.
The UniqueColumnList is the list is the list of columns whose values should be unique .Note that when using SETVAR, any values containing spaces must be surrounded by double quotes.
The JoinCondition is the UniqueColumnList translated into the form required to use in a self join between two aliases, T1 and T2. This is used when displaying a sample of duplicates.It is possible to derive the JoinCondition from the UniqueColumnList, but that would mean using dynamic SQL so it makes it harder to read and present.
The SampleSize and MaxRowsPerGroup parameters control how many duplicates are displayed - (in case you get thousands).
When you use SETVAR in Management Studio you need to use SQLCMD mode. In MS 2005 it is a toolbar button. In MS 2008 it is under the Query menu. If you don't do this you will get error messages because the :SETVAR and $(varname) syntax is not understood.
Section 2: Count the Duplicates
SET NOCOUNT ON; PRINT 'Count / show / delete duplicates records from $(TableName) based on ($(UniqueColumnList))'; -- 1. Count the duplicated records -- This is the number of records that will be deleted -- the same values for $(UniqueColumnList) this counts four |
Section 3: Show a Sample of the Duplicates
-- 3. Show a sample of the duplicated records WITH DupCounts AS ( SELECT _RowNum_ = row_number() OVER ( PARTITION BY $(UniqueColumnList) ORDER BY $(UniqueColumnList) ),* FROM $(TableName) ) SELECT TOP ($(SampleSize)) T1.* FROM DupCounts T1 WHERE T1._RowNum_ <= $(MaxRowsPerGroup) AND EXISTS (SELECT * FROM $(TableName) T2 WHERE $(JoinCondition) GROUP BY $(UniqueColumnList) HAVING COUNT(*) >1) ORDER BY $(UniqueColumnList), T1._RowNum_; |
Section 4: Delete the Duplicates
-- Delete duplicates if you need to: -- I leave this commented to avoid disasters WITH DupCounts AS ( SELECT _RowNum_ = row_number() OVER ( PARTITION BY $(UniqueColumnList) ORDER BY $(UniqueColumnList) ) FROM $(TableName) ) DELETE FROM DupCounts WHERE DupCounts._RowNum_ > 1; SELECT RowsDeleted = @@rowcount; |
Note that the delete statement as presented does not control which of the duplicate records in a group are deleted. It will delete all but the first record in each group (WHERE DupCounts._RowNum_ > 1). However, the row_number function is really returning a random ordering, because the ORDER BY columns are the same as the PARTITION BY columns. You would need to use another column in the ORDER BY clause if you needed to specify which records from the duplicate list to keep. For example, to ensure all but the lowest ContactID in each group is deleted, use this:
WITH DupCounts AS
( SELECT _RowNum_ = row_number()
OVER (
PARTITION BY $(UniqueColumnList)
ORDER BY ContactID )
FROM $(TableName)
)
DELETE FROM DupCounts WHERE DupCounts._RowNum_ > 1;