Any easy way to find if duplicate rows exist in a table?

  • What I am looking for is an easy simple way of querying if the entire row, is duplicated,

    i.e.

    For this table, are there 2 rows where every single column value is duplicated between those 2 rows.

    Is there any simple syntax where the user does not have to explicitly list all column values of the table?

    I know of this tightly coupled laborious way

    select a,b,c,d,...............................................zzz, count(*)

    from tbltest

    group by a,b,c,d,.......................................zzz

    The reason I ask for something without column names is I want to create a loop which uses objectid, and it cannot be tightly coupled to each table if I want it to run against the entire database.

    Or at least, just replace the query name with a new table name without having to reconstruct the query for each table in the entire db.

    About 250 tables in total.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • lookup group by and having

    it should look like

    select a , b, c from tb1

    where something

    group by a,b,c having count(*) > 1

    Jayanth Kurup[/url]

  • Thanks for the response.

    Yep, as I specified in my query, that is how it is done, with a particular query for a particular table.

    What I am looking for is a way to query a table without having to hard code column names into the query, and labourously list all column names.

    This should allow the user to replace the table name only. (as could be done in a loop)

    Specifically, I would like to avoid rebuilding the select and group by each time for a new table.

    Imagine writing 250 unique queries for 250 unique tables to see which of those unique 250 tables contain duplicate rows.

    That is what I want to avoid.

    Some tables have 501 columns 🙁

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • sp_msforeachtable '

    with cte as (

    select row_number() Over (partition by CHECKSUM(*) order by CHECKSUM(*)) as rowcount1 ,* from ?

    )

    select * from cte

    where rowcount1 >1'

    I think this should work for you , however there are limitations to using Checksum on tables with a very large number of rows. Please look this up in MSDN before you run this. Also I depending on the table and data there could be a performance hit

    Jayanth Kurup[/url]

  • Thanks

    Somehow this query is returning records which dont have duplicates, but what will help me a lot is the

    sp_msforeachtable

    Thanks for the hint!

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Gosh... I wouldn't use CHECKSUM for such a thing. It's nothing but a simple Exclusive OR and getting duplicate CHECKSUM values for different data is much more likely than folks would think. My recommendation would be to build a little dynamic SQL that would write the classic GROUP BY method for finding duplicate rows for each table.

    As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It's very easy to construct an SQL string containing the column list from system views and execute it.

    For example, if your database were called "playpen" this stored procedure would do the job (called by

    exec playpen.dbo.finddups 'sch', 'table' to list the rows occurring multiple times in the table playpen.sch.table)

    use playpen

    go

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbofinddups') AND type in (N'P', N'PC'))

    DROP PROC dbo.finddups

    go

    create procedure dbo.finddups

    @schema sysname,

    @table sysname

    as

    declare @sql varchar(max) = ''

    select @sql = @sql+','+C.name from playpen.sys.columns C

    inner join

    (select name, object_id, schema_id from playpen.sys.tables where name = @table) T

    on T.object_id = C.object_id

    inner join (select name, schema_id from playpen.sys.schemas where name = @schema) S

    on S.schema_id = T.schema_id

    option (MAXDOP 1)

    set @sql = 'select count(*), * from '+@schema+'.'+@table+

    ' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

    exec(@sql)

    go

    Obviously if you wanted to handle all tables in one call the table name and schema would be output instead of input parameters, and the thing could loop over sys.tables. The code above is not what you would use - but it shows how to do it.

    Tom

  • My only concern would be if any of the tables contained columns defined as text or ntext.

  • Lynn Pettis (8/10/2011)


    My only concern would be if any of the tables contained columns defined as text or ntext.

    Well spotted, Lynn! if there are such columns spotting duplicates is much more difficult.

    In fact it's quite nasty. One can exclude those columns from the grouping, and eliminate some of the rows from the possible duplicates list that way (anything unique on those columns is unique on the whole set). But checksum (or binary checksum) can't be used to eliminate more rows because they don't support text/ntext, so I guess the next step would be to use hashbytes on initial substrings of the long columns and hope that eliminated some more rows, and write code to compare the things byte by agonising byte to eliminate any remaining false positives. I wouldn't want to have to do that. No indeed. Too much like hard work, and I can imagine that running it would be tedious, somewhere in between watching paint dry and watching grass grow, if the earlier steps had eliminated almost all rows as possible duplicates.

    Tom

  • Thanks Tom

    This is the direction I wanted to go in.

    At worst, I would then replace table name and run your dynamic query together with information schema.

    At best, I can use your dynamic query with a sp_MSforeachtable.

    Off to play around with your suggestion!!:w00t:

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • a method i've always used and works for me is this:

    with t (rn, col1, col2) as

    (

    select

    rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),

    col1, col2, etc

    from tablename

    )

    select * from t where rn > 1

    --delete t where rn > 1

    This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.

    _________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie

  • Thanks tom

    Only really had to add

    SET @sql = ''

    select @sql = @sql+','+'['+C.name+']' from sys.columns C

    to take care of columns with spaces in the name and now I'm good to go.

    I have checked the results against tables which I have already drudged through

    So to all that need this, here is Tom's solution, which should work immediately.

    You need to pass schema and table name as the parameters.

    Use MyDataBase_Here

    Go

    create procedure dbo.finddups

    @schema sysname,

    @table sysname

    as

    declare @sql varchar(max)

    SET @sql = ''

    select @sql = @sql+','+'['+C.name+']' from sys.columns C

    inner join

    (select name, object_id, schema_id from sys.tables where name = @table) T

    on T.object_id = C.object_id

    inner join (select name, schema_id from sys.schemas where name = @schema) S

    on S.schema_id = T.schema_id

    option (MAXDOP 1)

    set @sql = 'select count(*),* from '+@schema+'.'+@table+

    ' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

    exec(@sql)

    GO

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.

    create procedure dbo.finddups

    @schema sysname,

    @table sysname

    as

    declare @sql varchar(max)

    SET @sql = ''

    select @sql = @sql+','+'['+C.name+']' from sys.columns C

    inner join

    (select name, object_id, schema_id from sys.tables where name = @table) T

    on T.object_id = C.object_id AND c.is_identity = 0

    inner join (select name, schema_id from sys.schemas where name = @schema) S

    on S.schema_id = T.schema_id

    option (MAXDOP 1)

    set @sql = 'select count(*),'+SUBSTRING(@sql,2,len(@sql))+' from '+@schema+'.'+@table+

    ' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

    exec(@sql)

    GO

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • ChrisTaylor (8/11/2011)


    a method i've always used and works for me is this:

    with t (rn, col1, col2) as

    (

    select

    rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),

    col1, col2, etc

    from tablename

    )

    select * from t where rn > 1

    --delete t where rn > 1

    This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.

    I would try this with ORDER BY SELECT NULL instead of by each column. Wonder if a) you CAN do that and b) if it makes a difference. I would expect it would, but could be wrong or the benefit is minimal.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • MadTester (8/11/2011)


    In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.

    Nice change. You could also do something similar to exclude any columns that are in a key constraint.

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply