Deleting Duplicate Records

Question

Deleting Duplicate Records

Viewing 15 posts - 16 through 30 (of 44 total)

You must be logged in to reply to this topic. Login to reply

tbredeme Ten Centuries Points: 1106 More actions · Answer 1

While I agree that proper design will eliminate the need for such scripts, inevitably we will all inherit poorly structured systems which necessitate this functionality. Personally I prefer to use a stored procedure that I found a few years ago. I didn't write it - I think I have only used it a few times - but it has saved my bacon each time.

/*    Description: UTILITY - Locate in MASTER


        Syntax: EXEC sp_RemoveDups TableName, DupQualifierFieldNameList, DeleteDups, UniqueColName, CreateIdentColIfNeeded, StoredProcedureResult
           Only the first two arguments are required
           For HELP, enter the stored procedures name without any arguments or see the PRINT statements below
           NO DELETION WILL OCCUR by default - only duplicate recordset returned.  To delete records, pass in a 0 for the DeleteDups argument.

       Example: EXEC sp_RemoveDups 'MyTable','LastName,FirstName,HomePhone'

         Purpose: Allow removal of duplicate rows where
           1. We define what fields qualify the duplicate
           2. We select the unique rowid or it is detected automatically else no action takes place

   Method:        Delete by RowID all duplicate rows except the highest RowID (in alpha-sort sequence)
           of each group of duplicates.

        DATE        BY                CHANGE
   09-23-2002    Frank                Original v1.0
   09-23-2002    Frank                Changed the name from sp_RemoveDupsByRowID to sp_RemoveDups
   10-8-2002    Sean P. O. MacCath-Moran    Made @UniqueColName optional
   Added logic to auto-detect RowGUID and Identity columns
   Added logic to check for unique value column if no RowGUID or Identity column exists
   Added logic to create a temporary ID field as a last resort if no unique column could be located
   Added HELP

*/

CREATE PROCEDURE sp_RemoveDups
@TableName as varchar(50) = NULL,
@DupQualifierFieldNameList as varchar(200) = NULL,
@DeleteDups as bit = NULL,
@UniqueColName as varchar(50) = NULL,
@CreateIdentColIfNeeded as bit = NULL,
@StoredProcedureResult int = NULL OUTPUT

AS
SET NOCOUNT ON

DECLARE @SQL nvarchar(2000)
DECLARE @SQL_DetermineUniqueTemplate nvarchar(2000)
DECLARE @TempIdentColName varchar(20)
DECLARE @HostName varchar(50)
DECLARE @ActionText varchar(10)

DECLARE @SUM int
DECLARE @COUNT int
DECLARE @NextColumn varchar(50)


/*==================================================================================*/
/*========================VARIABLE INITIALIZATION AND SETUP========================*/
/*=================================================================================*/
/*If no unique column is located then a temporary Identity column will be created useing the name specified in this TempIdentColName string*/
SET @TempIdentColName = 'TempIdentityColXY123'

SET @SQL_DetermineUniqueTemplate = 'SELECT @COUNT = COUNT(Count), @SUM = sum(Count) from '
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + '(SELECT TOP 100 PERCENT , COUNT(*) as Count FROM ' + @TableName
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + ' GROUP BY  ORDER BY ) a'

/*Retrieve the Host Name.  This will be used later in this SP as a test to determine if the user is making this call from within SQL Query Analyzer*/
SELECT @HostName = hostname FROM master..sysprocesses WHERE spid = @@SPID AND program_name = 'SQL Query Analyzer'


/*Set ActionText to be used in message output*/
IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
   SET @ActionText = 'Selection'
ELSE
   SET @ActionText = 'Deletion'

/*If a value is specified for use by UniqueColName it cannot exist in the columns from the DupQualifierFieldNameList*/
IF CHARINDEX(@UniqueColName, @DupQualifierFieldNameList) > 0
   BEGIN
       /*The value in UniqueColName was detected in DupQualifierFieldNameList.*/
       IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') must not exist in DupQualifierFieldNameList (' + @DupQualifierFieldNameList + ').  Other solutions will be sought automatically.'
       SET @UniqueColName = NULL
   END


/*If UniqueColName is provided then perform check to ensure that all the values in that column are, in fact, unique.*/
IF NOT (@UniqueColName IS NULL)
   BEGIN
       SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @UniqueColName)
       /*Perform a check of this column to determine if all of it's values are unique*/
       EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
       /*Test to determine if this column contains unique values*/
       If @SUM  @COUNT
           BEGIN
               /*The column specified by UniqueColName does not contain unique values.*/
               IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') does not contain unique values.  Other solutions will be sought automatically.'
               SET @UniqueColName = NULL
           END
   END


/*==============================================================*/
/*======================HELP OUTPUT TEXT======================*/
/*==============================================================*/
IF (@TableName IS NULL) OR (@TableName = '/?') OR (@TableName = '?') OR (@DupQualifierFieldNameList IS NULL) OR (@DupQualifierFieldNameList = '/?') OR (@DupQualifierFieldNameList = '?')
   BEGIN
       IF NOT (@HostName IS NULL)
           BEGIN
               PRINT 'sp_RemoveDups ''TableName'', ''DupQualifierFieldNameList'', [''DeleteDups''], [''UniqueColName''], [''CreateIdentColIfNeeded''], '
               PRINT '====================================================================================================================================================================='
               PRINT 'TableName: Required - String - Name of the table to detect duplicate records in.'
               PRINT 'DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName.'
               PRINT 'DeleteDups: Optional - Bit, Set to 0 to delete duplicate records.  A value of NULL or 1 will return the duplicate records to be deleted.'
               PRINT 'UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic.  If no UniqueColName is provided then an attemp will be made to locate the RowGUID column.  If that fails then an attempt will be made to locate the Identity column.  If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected.'
               PRINT 'CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if no unique column can be located.  Pass in a 1 here to run this feature off.'
               PRINT 'StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.'
           END
       SET @StoredProcedureResult = 3
       RETURN
   END


/*========================================================================*/
/*======================DETECT USABLE UniqueColName======================*/
/*========================================================================*/
IF @UniqueColName IS NULL
   BEGIN
       /*Check for a RowGUID or Identity column in this table.  If one exists, then utilze it as the unique value for the purposes of this deletion*/
       IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsRowGUIDCol') = 1) SET @UniqueColName =  'RowGUIDCol'
       IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsIdentity') = 1) SET @UniqueColName =  'IdentityCol'

       IF @UniqueColName IS NULL
       /*If no RowGUID or Identity column was found then check all of the columns in this table to see if one of them can be utilized as a unique value column*/
           BEGIN
               /*Select all of the columns from the table in question...*/
               DECLARE MyCursor CURSOR LOCAL SCROLL STATIC FOR SELECT name FROM syscolumns WHERE OBJECT_ID(@TableName)=ID

               OPEN MyCursor
               FETCH NEXT FROM MyCursor INTO @NextColumn
               WHILE @@fetch_status = 0
                   BEGIN
                       /*Create SQL string with correct column name in place.*/
                       SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @NextColumn)
                       /*Perform a check of this column to determine if all of it's values are unique*/
                       EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
                       /*Test to determine if this column contains unique values*/
                       If @SUM = @COUNT
                           BEGIN
                               /*A unique values column is detected.  Use it and break out of the loop UNLESS column is specified in DupQualifierFieldNameList*/
                               IF CHARINDEX(@NextColumn, @DupQualifierFieldNameList) = 0
                                   BEGIN
                                       /*NextColumn was NOT detected in DupQualifierFieldNameList, so this is the column we will use.*/
                                       SET @UniqueColName = @NextColumn
                                       BREAK
                                   END
                           END
                       ELSE
                       FETCH NEXT FROM MyCursor INTO @NextColumn
                   END
               CLOSE MyCursor
               DEALLOCATE MyCursor

           END
   END

/*If no UniqueColName has been found then create one UNLESS @CreateIdentColIfNeeded = 1*/
IF (@UniqueColName IS NULL) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
   BEGIN
       /*Add a sequence column to the table...*/
       IF NOT (@HostName IS NULL) PRINT 'Creating temporary identity column in the ' + @TableName + ' table named ' + @TempIdentColName + ' for use in this ' + LOWER(@ActionText) + ' process...'
       EXEC('ALTER TABLE ' + @TableName + ' ADD ' + @TempIdentColName + ' [int] IDENTITY (1, 1)')
       SET @UniqueColName =  @TempIdentColName
   END


/*============================================================================*/
/*======================EXECUTE DELETION OR SELECTION======================*/
/*===========================================================================*/
IF @UniqueColName IS NULL
   BEGIN
       /*No UniqueColName was provided by the user and none were detected by the script.  This deletion algorythm cannot run.*/
       IF NOT (@HostName IS NULL) PRINT 'Could not perform ' + LOWER(@ActionText) + ' process.  No unique columns were located and the UniqueColName flag is set to 1 (False).'
       SET @StoredProcedureResult = 3
       RETURN
   END
ELSE
   BEGIN
       IF NOT (@HostName IS NULL) PRINT 'Performing ' + LOWER(@ActionText) + ' utilizing the unique values in the ' + @UniqueColName + ' column as a reference...'
       /*
       Create and execute an SQL statement in the form of:

       SELECT * (or DELETE)
           FROM TableName WHERE UniqueColName IN
           (
           SELECT UniqueColName FROM TableName WHERE UniqueColName NOT IN
               (
               SELECT MAX(Cast(UniqueColName AS varchar(36))) FROM TableName GROUP BY DupQualifierFieldNameList, DupQualifierFieldNameList, etc
               )
           )
       */
       /*Delete all duplicate records useing @UniqueColName as a unique ID column */
       IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
           SET @SQL = 'SELECT * '
       ELSE
           SET @SQL = 'DELETE '

       SET @SQL = @SQL + 'FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' IN '
       SET @SQL = @SQL + CHAR(13) + CHAR(9) + '(' + CHAR(13) + CHAR(9)
       SET @SQL = @SQL + 'SELECT ' + @UniqueColName + ' FROM ' + @TableName  + ' WHERE ' + @UniqueColName + ' NOT IN '
       SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + '(' + CHAR(13) + CHAR(9)+CHAR(9)
       SET @SQL = @SQL + 'SELECT MAX(Cast(' + @UniqueColName + ' AS varchar(36))) FROM '
       SET @SQL = @SQL + @TableName + ' GROUP BY ' + @DupQualifierFieldNameList
       SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + ')' + CHAR(13) + CHAR(9) + ')'

       EXEC (@SQL)
       IF @@ERROR  0
           BEGIN
               IF NOT (@HostName IS NULL) PRINT @ActionText + ' process failed.'
               SET @StoredProcedureResult = 3
           END
       ELSE
           BEGIN
               IF NOT (@HostName IS NULL) PRINT @ActionText + ' completed successfully with this SQL: ' + CHAR(13) + @SQL
               SET @StoredProcedureResult = 0
           END
   END


IF (@UniqueColName = @TempIdentColName) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
   BEGIN
       /*Remove the sequence column from the table...*/
       IF NOT (@HostName IS NULL) PRINT 'Removing temporary identity column named ' + @TempIdentColName + ' from the ' + @TableName + ' table...'
       EXEC('ALTER TABLE ' + @TableName + ' DROP COLUMN ' + @TempIdentColName)
   END

GO


/*
USAGE

sp_RemoveDups 'TableName', 'DupQualifierFieldNameList', ['DeleteDups'], ['UniqueColName'], ['CreateIdentColIfNeeded'], 
================================================================================================================================================
TableName: Required - String - Name of the table to detect duplicate records in.
DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the 
unique record within TableName.
DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will 
return the duplicate records to be deleted.
UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the 
deletion logic. If no UniqueColName is provided then an attemp will be made to locate 
the RowGUID column. If that fails then an attempt will be made to locate the 
Identity column. If that fails then all of the columns of the table will be examined 
and the first one with all unique values will be selected.
CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if 
no unique column can be located. Pass in a 1 here to run this feature off.
StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.

Simple Syntax:
Use [databaseName]
sp_RemoveDups '[tableName]','[uniqueColName]','[Optional bit flag - set to zero to remove dups, omit to report only]'
*/

Artificial Intelligence stands no chance against Natural Stupidity.

punniyakodi SSC Veteran Points: 274 More actions · Answer 2

Appreciate your alternative method to delete duplicate records.

dbalos Newbie Points: 5 More actions · Answer 3

My two cents about database design and avoiding duplicates - if you have unstructured, flat data, let's say monitoring system data about 20 years worth, in various excel, text and other forms of (electronic) data storage, with about 27 milion lines of data, where the only identifier might be four out of 5 columns (timestamp, status1, status2, measurandID), the only solution for importing you have is to use DTS... And because you have to import the data as fast as possible, you might want to avoid having any kind of indexing in place, let alone constraint checks. So you end up with about 10% duplicates or even more, but you would not care too much, except that your database blows up to 16 GB, after you finish indexing in order to be able to use the data at all.

Lee Davis-285380 SSC Enthusiast Points: 146 More actions · Answer 4

Excellent Script and a very explanation of it!

Thanks!

Jules Bui SSC Eights! Points: 857 More actions · Answer 5

I totally agree with avoiding cursors, they are fun only if you have a table with limited rows of data. Try running a cursor with millions of records and you will be waiting forever for it to complete plus the resources dedicated to running the process could be a lot more useful elsewhere.

Jules Bui
IT Operations DBA
Backup and Restore Administrator

Daren Bieniek Valued Member Points: 57 More actions · Answer 6

The script is interesting, but I agree with the other people who have posted here that a non-cursor method is the best. In fact, I believe that cursors should be hidden in BOL in an advanced section with huge warnings around it that say things like "USE AT YOUR OWN RISK" or something like that. We definitely don't want newbies being shown solutions that use cursors, which appears to advocate their use. I have done work like this for years with Billions of records and found that the following is the fastest method to remove the dupes, if there is a substantial percentage of dupes.

1. Move the unique rows into another table (either created ahead of time, or using select...into); however the new table should have no indexes yet.

2. Drop the original table.

3. Rename the new table to the same name as the old and create any needed indexes, etc.

Moving the unique rows into a new table is substantially faster than doing seek operations on the original table (what the engine does internally) to delete rows, unless there is a very low percentage of duplicates (say a few percent, which can vary slightly). Also, having no indexes causes the inserts to go much faster and eliminates the index fragmentation that would occur and have to be cleaned up later. Next, dropping of the original table and the metadata action of renaming the table is much faster than copying the records from a temp table or variable back into another table. Finally, creating any needed indexes, etc. (make sure you do the clustered index first), makes them start at a non-fragmented state.

A couple of things about the replies concerning how the data got this way to begin with. Usually this type of scenario is one that is done during data cleansing for things like combining data from different sources, for example a data warehouse load from multiple source systems. Many times operations like this are not only necessary, but can constitute a substantial amount of the data cleansing work. Personally, I call operations like this "homogenizing the data" or making the data homogeneous. If you have multiple source systems from say, multiple sub companies that are run under a single holding company and each sub company has autonomous operations, then it is very difficult (or impossible) to get the data clean to begin with.

However, the 2 most difficult aspects of removing duplicates are not discussed in his article. One aspect is deciding which duplicate is the right one. In the real world duplicates are not really just dupes based on all columns, but rather they are 2 rows with the same key, but the rest of the columns differ. Using the author's example "non-dupe" table, record id=2, name='joe' has 2 rows, one with a salary of 1000 and the other 2000. Since name is the business key then this is a dupe, but which do we get rid of. A wrong decision one way or the other would either short-change joe or double his salary. This is one of the real difficulties in remove dupes.

The second difficulty in removing duplicates is in removing records that are duplicates, but have differing business keys. For example, we are bringing together data from 2 sub-companies that both buy from 3COM, but the companies spell it differently. For one company it is spelled '3COM' and the other has multiple entries '3 COM', '3.COM', and 'Three Com'. So how do we make it so that all of these records are linked to a single entity? I have methods that can do most of the matching of this type of data in the older DTS and using the newer SSIS fuzzy matching capabilities. It is a good start, but be careful as you can miss matches and make wrong matches, inadvertently. If people are interested, I could probably write up something about this. However, I just wanted to make the point that simple constraints will not always work. You have the old principle of GIGO (Garbage In, Garbage Out). However, using certain techniques, you can at least clean up the trash a little and make it a little more presentable.

I hope this was helpful to people.

Brian Y. Anderson Valued Member Points: 69 More actions · Answer 7

I was totally looking for something like this. The fact that it ended up being so elgant and simple made me feel ignorant for a moment. This is a very helpful script. It's funny that after you think about trying to do something like this for so long you end up over thinking it. Then when you provide such a simple solution, it's like 'wow, how could I not have seen this before!'

Thanks again,

Brian

kevin mann Hall of Fame Points: 3894 More actions · Answer 8

--OR!!

declare @affected tinyint

SET @affected = 1

SET ROWCOUNT 1

WHILE @affected > 0

BEGIN

DELETE employee

FROMemployee

INNER JOIN

(

SELECT id, name, salary

FROM employee

GROUP BY id, name,salary

HAVING COUNT(1)>1

) dupes

ONdupes.id = employee.id

ANDdupes.name = employee.name

ANDdupes.salary = employee.salary

SET @affected = @@ROWCOUNT

END

SET ROWCOUNT 0

SELECT * FROM employee

ORDER BY id

DROP TABLE employee

William Alber Valued Member Points: 57 More actions · Answer 9

I'll typically use a DELETE FROM / GROUP BY combination in conjunction with a temporary table:

DELETE FROM employee

WHERE id NOT IN (

SELECT MIN(id)

FROM employee

GROUP BY name, salary

)

...assuming that id is an identity column (if your table doesn't have an identity column then you're on your own!).

I'll often adapt this approach if there are e.g. timestamp fields, such that they're not included in the GROUP BY (if you'd still want to consider the record dupe'd regardless).

If you're after all those records with the highest timestamp etc. then a temporary table with an identity column works wonders.

To kill off all records with dupe'd emp name, leaving the record with the highest salary...

CREATE TABLE #emps (

myID INTEGER IDENTITY(1, 1),

id INTEGER,

name NCHAR(100), -- whatever

salary MONEY

)

INSERT INTO #emps

SELECT id, name, salary

FROM employee

ORDER BY salary DESC

DELETE FROM employee

WHERE id NOT IN (

SELECT id FROM #emps

WHERE myID NOT IN (

SELECT MIN(myID)

FROM #emps

GROUP BY name

)

...I realise that NOT IN is not a particularly expedient operation, but have had great success with this in the past.

sheepoo SSC Eights! Points: 959 More actions · Answer 10

My personal experience is that duplicate records in the database means poor validation on both client and DB side as well as poor database design. I agree with those people who have gone the way of 'nip the evil in the bud' and not allow duplicates to get in the database.

The author mentioned that this article is for novice SQL Server users: I would rather train novice users how to avoid problems in the first place rather than react to them and realize that some things were left wanting in the DB design

Just my 2 cents

kevin mann Hall of Fame Points: 3894 More actions · Answer 11

how about a new admin taking over for a careless admin

Andrew Clarke-161005 Mr or Mrs. 500 Points: 587 More actions · Answer 12

I'm surprised that anyone working with SQL Server needs lessons in removing duplicates, and puzzled at the complexity of this cursor-based solution originally published. A better solution has already been pointed out in the forum. As to the opinion that databases should never contain duplicate entries, yes, but somehow they sometimes creep in. Like the Kernighan and Richie exception error message, 'this is impossible', which one should never see and often does. Commercial databases are often of frightening complexity and one should be always on guard for duplicates. Yes, constraints are one way of doing it, but there are times when the overhead is too great.

brian mcmillin Valued Member Points: 67 More actions · Answer 13

Why not just do a select distinct into a new table, Delete the old and then rename the new to the old.

kevin mann Hall of Fame Points: 3894 More actions · Answer 14

kevin mann

Hall of Fame

Points: 3894

March 13, 2006 at 7:58 am

#626212

That was mentioned above

johncyriac SSCommitted Points: 1542 More actions · Answer 15

Theoretically speaking select distinct is the best way but, which are the fields you want to be distinct may change in each case.

Virtually these are the problems of people who deals with , dataporting /migration not of DBAs