Any easy way to find if duplicate rows exist in a table?

Question

Post reply

Any easy way to find if duplicate rows exist in a table?

MadAdmin

SSChampion

Points: 11261
More actions
August 10, 2011 at 1:31 am

#245949

What I am looking for is an easy simple way of querying if the entire row, is duplicated,
i.e.
For this table, are there 2 rows where every single column value is duplicated between those 2 rows.
Is there any simple syntax where the user does not have to explicitly list all column values of the table?
I know of this tightly coupled laborious way
select a,b,c,d,...............................................zzz, count(*)
from tbltest
group by a,b,c,d,.......................................zzz
The reason I ask for something without column names is I want to create a loop which uses objectid, and it cannot be tightly coupled to each table if I want it to run against the entire database.
Or at least, just replace the query name with a new table name without having to reconstruct the query for each table in the entire db.
About 250 tables in total.
Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 1

lookup group by and having

it should look like

select a , b, c from tb1

where something

group by a,b,c having count(*) > 1

Jayanth Kurup[/url]

MadAdmin SSChampion Points: 11261 More actions · Answer 2

Thanks for the response.

Yep, as I specified in my query, that is how it is done, with a particular query for a particular table.

What I am looking for is a way to query a table without having to hard code column names into the query, and labourously list all column names.

This should allow the user to replace the table name only. (as could be done in a loop)

Specifically, I would like to avoid rebuilding the select and group by each time for a new table.

Imagine writing 250 unique queries for 250 unique tables to see which of those unique 250 tables contain duplicate rows.

That is what I want to avoid.

Some tables have 501 columns 🙁

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 3

sp_msforeachtable '

with cte as (

select row_number() Over (partition by CHECKSUM(*) order by CHECKSUM(*)) as rowcount1 ,* from ?

)

select * from cte

where rowcount1 >1'

I think this should work for you , however there are limitations to using Checksum on tables with a very large number of rows. Please look this up in MSDN before you run this. Also I depending on the table and data there could be a performance hit

Jayanth Kurup[/url]

MadAdmin SSChampion Points: 11261 More actions · Answer 4

Thanks

Somehow this query is returning records which dont have duplicates, but what will help me a lot is the

sp_msforeachtable

Thanks for the hint!

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

Jeff Moden SSC Guru Points: 1004329 More actions · Answer 5

Gosh... I wouldn't use CHECKSUM for such a thing. It's nothing but a simple Exclusive OR and getting duplicate CHECKSUM values for different data is much more likely than folks would think. My recommendation would be to build a little dynamic SQL that would write the classic GROUP BY method for finding duplicate rows for each table.

As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TomThomson SSC Guru Points: 104773 More actions · Answer 6

It's very easy to construct an SQL string containing the column list from system views and execute it.

For example, if your database were called "playpen" this stored procedure would do the job (called by

exec playpen.dbo.finddups 'sch', 'table' to list the rows occurring multiple times in the table playpen.sch.table)

use playpen

go

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbofinddups') AND type in (N'P', N'PC'))

DROP PROC dbo.finddups

go

create procedure dbo.finddups

@schema sysname,

@table sysname

as

declare @sql varchar(max) = ''

select @sql = @sql+','+C.name from playpen.sys.columns C

inner join

(select name, object_id, schema_id from playpen.sys.tables where name = @table) T

on T.object_id = C.object_id

inner join (select name, schema_id from playpen.sys.schemas where name = @schema) S

on S.schema_id = T.schema_id

option (MAXDOP 1)

set @sql = 'select count(*), * from '+@schema+'.'+@table+

' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

exec(@sql)

go

Obviously if you wanted to handle all tables in one call the table name and schema would be output instead of input parameters, and the thing could loop over sys.tables. The code above is not what you would use - but it shows how to do it.

Tom

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

My only concern would be if any of the tables contained columns defined as text or ntext.

TomThomson SSC Guru Points: 104773 More actions · Answer 8

Lynn Pettis (8/10/2011)
My only concern would be if any of the tables contained columns defined as text or ntext.

Well spotted, Lynn! if there are such columns spotting duplicates is much more difficult.

In fact it's quite nasty. One can exclude those columns from the grouping, and eliminate some of the rows from the possible duplicates list that way (anything unique on those columns is unique on the whole set). But checksum (or binary checksum) can't be used to eliminate more rows because they don't support text/ntext, so I guess the next step would be to use hashbytes on initial substrings of the long columns and hope that eliminated some more rows, and write code to compare the things byte by agonising byte to eliminate any remaining false positives. I wouldn't want to have to do that. No indeed. Too much like hard work, and I can imagine that running it would be tedious, somewhere in between watching paint dry and watching grass grow, if the earlier steps had eliminated almost all rows as possible duplicates.

Tom

MadAdmin SSChampion Points: 11261 More actions · Answer 9

Thanks Tom

This is the direction I wanted to go in.

At worst, I would then replace table name and run your dynamic query together with information schema.

At best, I can use your dynamic query with a sp_MSforeachtable.

Off to play around with your suggestion!!:w00t:

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

ChrisTaylor SSCarpal Tunnel Points: 4527 More actions · Answer 10

a method i've always used and works for me is this:

with t (rn, col1, col2) as

(

select

rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),

col1, col2, etc

from tablename

)

select * from t where rn > 1

--delete t where rn > 1

This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.

_________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie

MadAdmin SSChampion Points: 11261 More actions · Answer 11

Thanks tom

Only really had to add

SET @sql = ''

select @sql = @sql+','+'['+C.name+']' from sys.columns C

to take care of columns with spaces in the name and now I'm good to go.

I have checked the results against tables which I have already drudged through

So to all that need this, here is Tom's solution, which should work immediately.

You need to pass schema and table name as the parameters.

Use MyDataBase_Here
Go
create procedure dbo.finddups
@schema sysname,
@table sysname
as
declare @sql varchar(max)
SET @sql = ''
select @sql = @sql+','+'['+C.name+']' from sys.columns C
inner join
(select name, object_id, schema_id from sys.tables where name = @table) T
on T.object_id = C.object_id
inner join (select name, schema_id from sys.schemas where name = @schema) S
on S.schema_id = T.schema_id
option (MAXDOP 1)
set @sql = 'select count(*),* from '+@schema+'.'+@table+
' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'
exec(@sql)
GO

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

MadAdmin SSChampion Points: 11261 More actions · Answer 12

In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.

create procedure dbo.finddups

@schema sysname,

@table sysname

as

declare @sql varchar(max)

SET @sql = ''

select @sql = @sql+','+'['+C.name+']' from sys.columns C

inner join

(select name, object_id, schema_id from sys.tables where name = @table) T

on T.object_id = C.object_id AND c.is_identity = 0

inner join (select name, schema_id from sys.schemas where name = @schema) S

on S.schema_id = T.schema_id

option (MAXDOP 1)

set @sql = 'select count(*),'+SUBSTRING(@sql,2,len(@sql))+' from '+@schema+'.'+@table+

' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

exec(@sql)

GO

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 13

ChrisTaylor (8/11/2011)
a method i've always used and works for me is this:
with t (rn, col1, col2) as
(
select
rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),
col1, col2, etc
from tablename
)
select * from t where rn > 1
--delete t where rn > 1
This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.

I would try this with ORDER BY SELECT NULL instead of by each column. Wonder if a) you CAN do that and b) if it makes a difference. I would expect it would, but could be wrong or the benefit is minimal.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 14

MadTester (8/11/2011)
In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.

Nice change. You could also do something similar to exclude any columns that are in a key constraint.