December 20, 2005 at 7:37 am
/* Description: UTILITY - Locate in MASTER Syntax: EXEC sp_RemoveDups TableName, DupQualifierFieldNameList, DeleteDups, UniqueColName, CreateIdentColIfNeeded, StoredProcedureResult Only the first two arguments are required For HELP, enter the stored procedures name without any arguments or see the PRINT statements below NO DELETION WILL OCCUR by default - only duplicate recordset returned. To delete records, pass in a 0 for the DeleteDups argument. Example: EXEC sp_RemoveDups 'MyTable','LastName,FirstName,HomePhone' Purpose: Allow removal of duplicate rows where 1. We define what fields qualify the duplicate 2. We select the unique rowid or it is detected automatically else no action takes place Method: Delete by RowID all duplicate rows except the highest RowID (in alpha-sort sequence) of each group of duplicates. DATE BY CHANGE 09-23-2002 Frank Original v1.0 09-23-2002 Frank Changed the name from sp_RemoveDupsByRowID to sp_RemoveDups 10-8-2002 Sean P. O. MacCath-Moran Made @UniqueColName optional Added logic to auto-detect RowGUID and Identity columns Added logic to check for unique value column if no RowGUID or Identity column exists Added logic to create a temporary ID field as a last resort if no unique column could be located Added HELP */ CREATE PROCEDURE sp_RemoveDups @TableName as varchar(50) = NULL, @DupQualifierFieldNameList as varchar(200) = NULL, @DeleteDups as bit = NULL, @UniqueColName as varchar(50) = NULL, @CreateIdentColIfNeeded as bit = NULL, @StoredProcedureResult int = NULL OUTPUT AS SET NOCOUNT ON DECLARE @SQL nvarchar(2000) DECLARE @SQL_DetermineUniqueTemplate nvarchar(2000) DECLARE @TempIdentColName varchar(20) DECLARE @HostName varchar(50) DECLARE @ActionText varchar(10) DECLARE @SUM int DECLARE @COUNT int DECLARE @NextColumn varchar(50) /*==================================================================================*/ /*========================VARIABLE INITIALIZATION AND SETUP========================*/ /*=================================================================================*/ /*If no unique column is located then a temporary Identity column will be created useing the name specified in this TempIdentColName string*/ SET @TempIdentColName = 'TempIdentityColXY123' SET @SQL_DetermineUniqueTemplate = 'SELECT @COUNT = COUNT(Count), @SUM = sum(Count) from ' SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + '(SELECT TOP 100 PERCENT , COUNT(*) as Count FROM ' + @TableName SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + ' GROUP BY ORDER BY ) a' /*Retrieve the Host Name. This will be used later in this SP as a test to determine if the user is making this call from within SQL Query Analyzer*/ SELECT @HostName = hostname FROM master..sysprocesses WHERE spid = @@SPID AND program_name = 'SQL Query Analyzer' /*Set ActionText to be used in message output*/ IF (@DeleteDups IS NULL) OR (@DeleteDups = 1) SET @ActionText = 'Selection' ELSE SET @ActionText = 'Deletion' /*If a value is specified for use by UniqueColName it cannot exist in the columns from the DupQualifierFieldNameList*/ IF CHARINDEX(@UniqueColName, @DupQualifierFieldNameList) > 0 BEGIN /*The value in UniqueColName was detected in DupQualifierFieldNameList.*/ IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') must not exist in DupQualifierFieldNameList (' + @DupQualifierFieldNameList + '). Other solutions will be sought automatically.' SET @UniqueColName = NULL END /*If UniqueColName is provided then perform check to ensure that all the values in that column are, in fact, unique.*/ IF NOT (@UniqueColName IS NULL) BEGIN SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @UniqueColName) /*Perform a check of this column to determine if all of it's values are unique*/ EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT /*Test to determine if this column contains unique values*/ If @SUM @COUNT BEGIN /*The column specified by UniqueColName does not contain unique values.*/ IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') does not contain unique values. Other solutions will be sought automatically.' SET @UniqueColName = NULL END END /*==============================================================*/ /*======================HELP OUTPUT TEXT======================*/ /*==============================================================*/ IF (@TableName IS NULL) OR (@TableName = '/?') OR (@TableName = '?') OR (@DupQualifierFieldNameList IS NULL) OR (@DupQualifierFieldNameList = '/?') OR (@DupQualifierFieldNameList = '?') BEGIN IF NOT (@HostName IS NULL) BEGIN PRINT 'sp_RemoveDups ''TableName'', ''DupQualifierFieldNameList'', [''DeleteDups''], [''UniqueColName''], [''CreateIdentColIfNeeded''], ' PRINT '=====================================================================================================================================================================' PRINT 'TableName: Required - String - Name of the table to detect duplicate records in.' PRINT 'DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName.' PRINT 'DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will return the duplicate records to be deleted.' PRINT 'UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic. If no UniqueColName is provided then an attemp will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the Identity column. If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected.' PRINT 'CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if no unique column can be located. Pass in a 1 here to run this feature off.' PRINT 'StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.' END SET @StoredProcedureResult = 3 RETURN END /*========================================================================*/ /*======================DETECT USABLE UniqueColName======================*/ /*========================================================================*/ IF @UniqueColName IS NULL BEGIN /*Check for a RowGUID or Identity column in this table. If one exists, then utilze it as the unique value for the purposes of this deletion*/ IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsRowGUIDCol') = 1) SET @UniqueColName = 'RowGUIDCol' IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsIdentity') = 1) SET @UniqueColName = 'IdentityCol' IF @UniqueColName IS NULL /*If no RowGUID or Identity column was found then check all of the columns in this table to see if one of them can be utilized as a unique value column*/ BEGIN /*Select all of the columns from the table in question...*/ DECLARE MyCursor CURSOR LOCAL SCROLL STATIC FOR SELECT name FROM syscolumns WHERE OBJECT_ID(@TableName)=ID OPEN MyCursor FETCH NEXT FROM MyCursor INTO @NextColumn WHILE @@fetch_status = 0 BEGIN /*Create SQL string with correct column name in place.*/ SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @NextColumn) /*Perform a check of this column to determine if all of it's values are unique*/ EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT /*Test to determine if this column contains unique values*/ If @SUM = @COUNT BEGIN /*A unique values column is detected. Use it and break out of the loop UNLESS column is specified in DupQualifierFieldNameList*/ IF CHARINDEX(@NextColumn, @DupQualifierFieldNameList) = 0 BEGIN /*NextColumn was NOT detected in DupQualifierFieldNameList, so this is the column we will use.*/ SET @UniqueColName = @NextColumn BREAK END END ELSE FETCH NEXT FROM MyCursor INTO @NextColumn END CLOSE MyCursor DEALLOCATE MyCursor END END /*If no UniqueColName has been found then create one UNLESS @CreateIdentColIfNeeded = 1*/ IF (@UniqueColName IS NULL) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) ) BEGIN /*Add a sequence column to the table...*/ IF NOT (@HostName IS NULL) PRINT 'Creating temporary identity column in the ' + @TableName + ' table named ' + @TempIdentColName + ' for use in this ' + LOWER(@ActionText) + ' process...' EXEC('ALTER TABLE ' + @TableName + ' ADD ' + @TempIdentColName + ' [int] IDENTITY (1, 1)') SET @UniqueColName = @TempIdentColName END /*============================================================================*/ /*======================EXECUTE DELETION OR SELECTION======================*/ /*===========================================================================*/ IF @UniqueColName IS NULL BEGIN /*No UniqueColName was provided by the user and none were detected by the script. This deletion algorythm cannot run.*/ IF NOT (@HostName IS NULL) PRINT 'Could not perform ' + LOWER(@ActionText) + ' process. No unique columns were located and the UniqueColName flag is set to 1 (False).' SET @StoredProcedureResult = 3 RETURN END ELSE BEGIN IF NOT (@HostName IS NULL) PRINT 'Performing ' + LOWER(@ActionText) + ' utilizing the unique values in the ' + @UniqueColName + ' column as a reference...' /* Create and execute an SQL statement in the form of: SELECT * (or DELETE) FROM TableName WHERE UniqueColName IN ( SELECT UniqueColName FROM TableName WHERE UniqueColName NOT IN ( SELECT MAX(Cast(UniqueColName AS varchar(36))) FROM TableName GROUP BY DupQualifierFieldNameList, DupQualifierFieldNameList, etc ) ) */ /*Delete all duplicate records useing @UniqueColName as a unique ID column */ IF (@DeleteDups IS NULL) OR (@DeleteDups = 1) SET @SQL = 'SELECT * ' ELSE SET @SQL = 'DELETE ' SET @SQL = @SQL + 'FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' IN ' SET @SQL = @SQL + CHAR(13) + CHAR(9) + '(' + CHAR(13) + CHAR(9) SET @SQL = @SQL + 'SELECT ' + @UniqueColName + ' FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' NOT IN ' SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + '(' + CHAR(13) + CHAR(9)+CHAR(9) SET @SQL = @SQL + 'SELECT MAX(Cast(' + @UniqueColName + ' AS varchar(36))) FROM ' SET @SQL = @SQL + @TableName + ' GROUP BY ' + @DupQualifierFieldNameList SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + ')' + CHAR(13) + CHAR(9) + ')' EXEC (@SQL) IF @@ERROR 0 BEGIN IF NOT (@HostName IS NULL) PRINT @ActionText + ' process failed.' SET @StoredProcedureResult = 3 END ELSE BEGIN IF NOT (@HostName IS NULL) PRINT @ActionText + ' completed successfully with this SQL: ' + CHAR(13) + @SQL SET @StoredProcedureResult = 0 END END IF (@UniqueColName = @TempIdentColName) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) ) BEGIN /*Remove the sequence column from the table...*/ IF NOT (@HostName IS NULL) PRINT 'Removing temporary identity column named ' + @TempIdentColName + ' from the ' + @TableName + ' table...' EXEC('ALTER TABLE ' + @TableName + ' DROP COLUMN ' + @TempIdentColName) END GO /* USAGE sp_RemoveDups 'TableName', 'DupQualifierFieldNameList', ['DeleteDups'], ['UniqueColName'], ['CreateIdentColIfNeeded'], ================================================================================================================================================ TableName: Required - String - Name of the table to detect duplicate records in. DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName. DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will return the duplicate records to be deleted. UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic. If no UniqueColName is provided then an attemp will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the Identity column. If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected. CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if no unique column can be located. Pass in a 1 here to run this feature off. StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0. Simple Syntax: Use [databaseName] sp_RemoveDups '[tableName]','[uniqueColName]','[Optional bit flag - set to zero to remove dups, omit to report only]' */
Artificial Intelligence stands no chance against Natural Stupidity.
December 20, 2005 at 11:36 pm
Appreciate your alternative method to delete duplicate records.
March 10, 2006 at 3:18 am
My two cents about database design and avoiding duplicates - if you have unstructured, flat data, let's say monitoring system data about 20 years worth, in various excel, text and other forms of (electronic) data storage, with about 27 milion lines of data, where the only identifier might be four out of 5 columns (timestamp, status1, status2, measurandID), the only solution for importing you have is to use DTS... And because you have to import the data as fast as possible, you might want to avoid having any kind of indexing in place, let alone constraint checks. So you end up with about 10% duplicates or even more, but you would not care too much, except that your database blows up to 16 GB, after you finish indexing in order to be able to use the data at all.
March 10, 2006 at 4:57 am
Excellent Script and a very explanation of it!
Thanks!
March 10, 2006 at 7:10 am
I totally agree with avoiding cursors, they are fun only if you have a table with limited rows of data. Try running a cursor with millions of records and you will be waiting forever for it to complete plus the resources dedicated to running the process could be a lot more useful elsewhere.
Jules Bui
IT Operations DBA
Backup and Restore Administrator
March 10, 2006 at 7:40 am
The script is interesting, but I agree with the other people who have posted here that a non-cursor method is the best. In fact, I believe that cursors should be hidden in BOL in an advanced section with huge warnings around it that say things like "USE AT YOUR OWN RISK" or something like that. We definitely don't want newbies being shown solutions that use cursors, which appears to advocate their use. I have done work like this for years with Billions of records and found that the following is the fastest method to remove the dupes, if there is a substantial percentage of dupes.
1. Move the unique rows into another table (either created ahead of time, or using select...into); however the new table should have no indexes yet.
2. Drop the original table.
3. Rename the new table to the same name as the old and create any needed indexes, etc.
Moving the unique rows into a new table is substantially faster than doing seek operations on the original table (what the engine does internally) to delete rows, unless there is a very low percentage of duplicates (say a few percent, which can vary slightly). Also, having no indexes causes the inserts to go much faster and eliminates the index fragmentation that would occur and have to be cleaned up later. Next, dropping of the original table and the metadata action of renaming the table is much faster than copying the records from a temp table or variable back into another table. Finally, creating any needed indexes, etc. (make sure you do the clustered index first), makes them start at a non-fragmented state.
A couple of things about the replies concerning how the data got this way to begin with. Usually this type of scenario is one that is done during data cleansing for things like combining data from different sources, for example a data warehouse load from multiple source systems. Many times operations like this are not only necessary, but can constitute a substantial amount of the data cleansing work. Personally, I call operations like this "homogenizing the data" or making the data homogeneous. If you have multiple source systems from say, multiple sub companies that are run under a single holding company and each sub company has autonomous operations, then it is very difficult (or impossible) to get the data clean to begin with.
However, the 2 most difficult aspects of removing duplicates are not discussed in his article. One aspect is deciding which duplicate is the right one. In the real world duplicates are not really just dupes based on all columns, but rather they are 2 rows with the same key, but the rest of the columns differ. Using the author's example "non-dupe" table, record id=2, name='joe' has 2 rows, one with a salary of 1000 and the other 2000. Since name is the business key then this is a dupe, but which do we get rid of. A wrong decision one way or the other would either short-change joe or double his salary. This is one of the real difficulties in remove dupes.
The second difficulty in removing duplicates is in removing records that are duplicates, but have differing business keys. For example, we are bringing together data from 2 sub-companies that both buy from 3COM, but the companies spell it differently. For one company it is spelled '3COM' and the other has multiple entries '3 COM', '3.COM', and 'Three Com'. So how do we make it so that all of these records are linked to a single entity? I have methods that can do most of the matching of this type of data in the older DTS and using the newer SSIS fuzzy matching capabilities. It is a good start, but be careful as you can miss matches and make wrong matches, inadvertently. If people are interested, I could probably write up something about this. However, I just wanted to make the point that simple constraints will not always work. You have the old principle of GIGO (Garbage In, Garbage Out). However, using certain techniques, you can at least clean up the trash a little and make it a little more presentable.
I hope this was helpful to people.
March 10, 2006 at 7:42 am
I was totally looking for something like this. The fact that it ended up being so elgant and simple made me feel ignorant for a moment. This is a very helpful script. It's funny that after you think about trying to do something like this for so long you end up over thinking it. Then when you provide such a simple solution, it's like 'wow, how could I not have seen this before!'
Thanks again,
Brian
March 10, 2006 at 7:56 am
--OR!!
declare @affected tinyint
SET @affected = 1
SET ROWCOUNT 1
WHILE @affected > 0
BEGIN
DELETE employee
FROMemployee
INNER JOIN
(
SELECT id, name, salary
FROM employee
GROUP BY id, name,salary
HAVING COUNT(1)>1
) dupes
ONdupes.id = employee.id
ANDdupes.name = employee.name
ANDdupes.salary = employee.salary
SET @affected = @@ROWCOUNT
END
SET ROWCOUNT 0
SELECT * FROM employee
ORDER BY id
DROP TABLE employee
March 10, 2006 at 11:34 am
I'll typically use a DELETE FROM / GROUP BY combination in conjunction with a temporary table:
DELETE FROM employee
WHERE id NOT IN (
SELECT MIN(id)
FROM employee
GROUP BY name, salary
)
...assuming that id is an identity column (if your table doesn't have an identity column then you're on your own!).
I'll often adapt this approach if there are e.g. timestamp fields, such that they're not included in the GROUP BY (if you'd still want to consider the record dupe'd regardless).
If you're after all those records with the highest timestamp etc. then a temporary table with an identity column works wonders.
To kill off all records with dupe'd emp name, leaving the record with the highest salary...
CREATE TABLE #emps (
myID INTEGER IDENTITY(1, 1),
id INTEGER,
name NCHAR(100), -- whatever
salary MONEY
)
INSERT INTO #emps
SELECT id, name, salary
FROM employee
ORDER BY salary DESC
DELETE FROM employee
WHERE id NOT IN (
SELECT id FROM #emps
WHERE myID NOT IN (
SELECT MIN(myID)
FROM #emps
GROUP BY name
)
)
...I realise that NOT IN is not a particularly expedient operation, but have had great success with this in the past.
March 10, 2006 at 12:29 pm
My personal experience is that duplicate records in the database means poor validation on both client and DB side as well as poor database design. I agree with those people who have gone the way of 'nip the evil in the bud' and not allow duplicates to get in the database.
The author mentioned that this article is for novice SQL Server users: I would rather train novice users how to avoid problems in the first place rather than react to them and realize that some things were left wanting in the DB design
Just my 2 cents
March 10, 2006 at 12:34 pm
how about a new admin taking over for a careless admin
March 11, 2006 at 6:13 am
I'm surprised that anyone working with SQL Server needs lessons in removing duplicates, and puzzled at the complexity of this cursor-based solution originally published. A better solution has already been pointed out in the forum. As to the opinion that databases should never contain duplicate entries, yes, but somehow they sometimes creep in. Like the Kernighan and Richie exception error message, 'this is impossible', which one should never see and often does. Commercial databases are often of frightening complexity and one should be always on guard for duplicates. Yes, constraints are one way of doing it, but there are times when the overhead is too great.
March 12, 2006 at 12:29 pm
Why not just do a select distinct into a new table, Delete the old and then rename the new to the old.
March 13, 2006 at 7:58 am
That was mentioned above
March 14, 2006 at 9:30 pm
Theoretically speaking select distinct is the best way but, which are the fields you want to be distinct may change in each case.
Virtually these are the problems of people who deals with , dataporting /migration not of DBAs
Viewing 15 posts - 16 through 30 (of 44 total)
You must be logged in to reply to this topic. Login to reply