Removing Duplicate Rows
Introduction
"How do I remove duplicate rows from a table" is a common question
in the newsgroups, and here we will look at a few different techniques for
achieving this.
Databases can easily be designed to disallow duplicates (which I will discuss
at the bottom of the article) but the issue can still arise. For example,
duplicate data might arrive at your database via an interface to another system.
The code examples in this article were all developed on SQL 2000, but should
work fine in versions 6.5 and 7.
The "Distinct" technique
This is the easiest way of de-duplicating a table, and is at it’s best with
fairly small rowsets, and where all the columns in a table are duplicates. We
simply run a SELECT DISTINCT on the offending table, storing the results in a
temporary table, then delete the contents of the permanent table and re-populate
it from the temporary one.
The downside of this is that you need enough free data space in your database
(or in tempdb) to store the entire table again, plus plenty of log space when
you are de-duplicating large tables. This simple example loads a table with
duplicate data then removes the duplicates:
create table marxBrothers (
Number int,
Name varchar(32)
)
go
insert marxBrothers
select 1, 'Groucho Marx' UNION ALL
select 2, 'Harpo Marx' UNION ALL
select 3, 'Chico Marx' UNION ALL
select 1, 'Groucho Marx' UNION ALL
select 2, 'Harpo Marx' UNION ALL
select 3, 'Chico Marx' UNION ALL
select 6, 'Zeppo Marx' UNION ALL
select 7, 'Gummo Marx' UNION ALL
select 6, 'Zeppo Marx'
begin tran deduplicate
select DISTINCT *
into #temp
from marxBrothers
truncate table marxBrothers
insert marxBrothers
select *
from #temp
select * from marxBrothers
drop table #temp
commit tran deduplicate
drop table marxBrothers
Using derived tables
For tables where duplication is defined by a subset of the columns in the
table, you can use one of the other columns in the table to identify which rows
to keep and which to delete. Here is a simple example – it is similar to the
previous one, but note that only the names are duplicated, because the "ident"
column has been defined as an Identity column. Here we only want to keep one
instance of each name, regardless of the value in the Ident column. In the
example I have arbitrarily decided to keep the name with the lowest Ident value
create table marxBrothers (
ident int IDENTITY,
Name varchar(32)
)
go
insert marxBrothers (Name)
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Zeppo Marx' UNION ALL
select 'Gummo Marx' UNION ALL
select 'Zeppo Marx'
select * from marxBrothers
delete marxBrothers
from marxBrothers,
(
select min(ident) as minIdent, name
from marxBrothers m
group by name
having count(1) > 1
) as derived
where marxBrothers.name = derived.name
and ident > minIdent
select * from marxBrothers
drop table marxBrothers
The inner query selects all the names in the table that are duplicated, and
the minimum value of the Ident column for each name. These results are compared
against the main table in order to select the records for deletion
With this technique you may be able to get away with having much less free
room than the earlier technique, but this depends on the ratio of duplicated to
unique records – the more records are duplicated, the more space SQL Server
needs to temporarily store, and work with, the results that makeup the derived
table. The derived table technique is the one I usually try first.
Using correlated subqueries
This example uses a correlated sub-query to achieve de-duplication
create table marxBrothers (
ident int IDENTITY,
Name varchar(32)
)
go
insert marxBrothers (Name)
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Zeppo Marx' UNION ALL
select 'Gummo Marx' UNION ALL
select 'Zeppo Marx'
select * from marxBrothers
delete marxBrothers
where ident > (
select min(ident)
from marxBrothers m
where m.name = marxBrothers.name
)
select * from marxBrothers
drop table marxBrothers
Correlated sub-queries can be pretty slow and inefficient to run against
large tables.
De-duplicating using dynamic SQL
This technique uses a cursor and the "SET ROWCOUNT" feature to work
through the table one name at a time, deleting all but one of the rows for each
name. Cursors and Dynamic SQL both have their place, but I generally try to
avoid using them for reasons of efficiency, As this example uses both a cursor
and dynamic SQL, it can bet pretty inefficient with large numbers of duplicated
rows, and I would prefer to use one of the previous techniques. However, the
technique does work, so I think it is still worth a mention.
create table marxBrothers (
ident int IDENTITY,
Name varchar(32)
)
insert marxBrothers (Name)
select 'Groucho Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Groucho Marx' UNION ALL
select 'Harpo Marx' UNION ALL
select 'Chico Marx' UNION ALL
select 'Zeppo Marx' UNION ALL
select 'Gummo Marx' UNION ALL
select 'Zeppo Marx' UNION ALL
select 'Zeppo Marx'
go
declare @sql varchar(255),
@name varchar(32),
@count int
select * from marxBrothers
declare CI insensitive cursor for
select name, count(1) – 1 from marxBrothers m
group by name
having count(1) > 1
open CI
fetch next from CI into @name, @count
while @@fetch_status = 0
begin
select @sql =
'set rowcount ' + convert(varchar(9), @count) + '
delete marxBrothers where name = "' + @name + '"'
exec (@sql)
fetch next from CI into @name, @count
end-- while fetch
close CI
deallocate CI
select * from marxBrothers
go
drop table marxBrothers
Preventing duplication
Prevention is, after all, better than cure, so It would be pointless
discussing de-duplicating a table without mentioning how to prevent duplicate
rows appearing again afterwards. There are three main declarative methods for
doing this in SQL:
The Primary Key constraint
This is implemented as an index in SQL Server, and so it can also help speed
up select and join operations from the table. One or more columns may make up a
primary key, and SQL Server will automatically ensure that duplicates do not
exist in a table’s primary key. An error will be returned if you try to insert
duplicates.
create table marxBrothers (
ident intIDENTITY,
name varchar(32),
constraint PK_marxBrothers PRIMARY KEY
(name)
)
go
The Unique constraint
This is also Implemented as an index in SQL Server. The main difference
between a primary key and a unique constraint is that the Primary
Key can never contain a NULL value.
create table marxBrothers (
ident intIDENTITY,
name varchar(32),
constraint U_marxBrothers UNIQUE (name)
)
go
The IGNORE_DUP_KEY option
This is an option on the create index command, which will cause SQL
Server to discard rows where a duplicate of the columns in the specified index
already exists. Use this option with care – you have to be certain that you do
really need to do this. Also remember that the results of using this option can
easily be misinterpreted as an error by end users.
create table marxBrothers (
ident intIDENTITY,
name varchar(32),
)
go
create unique index UI_marxBrothers on marxBrothers (name) with ignore_dup_key
About the author
Neil Boyle is an independant SQL Server consultant working out of London,
England. Neil's free SQL Server guide is available on-line at
http://www.impetus-sql.co.uk