Any easy way to find if duplicate rows exist in a table?

  • Bruce W Cassidy (8/11/2011)


    MadTester (8/11/2011)


    In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.

    Nice change. You could also do something similar to exclude any columns that are in a key constraint.

    A multi-column key is usually a natural key, and if you have a natural key it probably isn't very interesting to look for duplicates in the remaining columns. I think the motivation for excluding an identity column is likely to be that there is a single column primary key which is neither a natural key nor a surrogate for a natural key but just a meaningless number based on how many rows were inserted before this one - that's been the case every time I've seen someone trying to do this before.

    But in the unusual case that you do want to look for duplicates in projections not including a complete candidate key, you might want to try excluding just one column from each candidate key, rather than all the columns in candidate keys (taking candidate key as any minimal set of columns each not permitting null and together subject to a unique constraint or a primary key constraint) - but of course it depends what column sets you are interested in.

    Tom

  • Tom.Thomson (8/11/2011)


    A multi-column key is usually a natural key

    In my own designs, composite keys tend to occur on many-to-many relationship bridging tables, so not necessarily a natural key at all. Anyway, a bit of a side-track from the post.

    Tom.Thomson (8/11/2011)


    I think the motivation for excluding an identity column is likely to be that there is a single column primary key which is neither a natural key nor a surrogate for a natural key but just a meaningless number based on how many rows were inserted before this one

    Sure. Excluding any identity field makes sense, whether it's a key or not.

    I guess I didn't think I needed to spell it out. If a record comparison includes a unique key within the record, there aren't going to be any duplicates. So to look for duplicates, excluding keys would make sense.

  • Bruce W Cassidy (8/11/2011)


    Tom.Thomson (8/11/2011)


    A multi-column key is usually a natural key

    In my own designs, composite keys tend to occur on many-to-many relationship bridging tables, so not necessarily a natural key at all. Anyway, a bit of a side-track from the post.

    In my designs, too , but I ignored that case because excluding the whole key in one of my bridge tables would usually exclude all columns, so there would certainly be no duplicates.

    I guess I didn't think I needed to spell it out. If a record comparison includes a unique key within the record, there aren't going to be any duplicates. So to look for duplicates, excluding keys would make sense.

    That didn't need spelling out.

    I intended to spell this out, because it's something that needs to be remembered once you start looking for duplicates produced by ignoring some columns: in the case of a composite key you may get duplicates by excluding a part of the key, you don't have to exclude the whole key; while if there are additional candidate keys (properly enforced with NOT NULL and UNIQUE constraints) as well as the primary key you won't get duplicates unless you exclude part of each key. I guess I failed to express it clearly.

    Tom

  • Tom.Thomson (8/11/2011)


    in the case of a composite key you may get duplicates by excluding a part of the key, you don't have to exclude the whole key

    Absolutely. The classic "last name, first name" would be a good example of where excluding a part of the key might well result in duplicates.

    One thing that would be useful to check for would be natural key migration across surrogate keys. So for example, the same person having multiple records under the same name, but different "identity" values. In such a case, excluding the identity field (or business key) but including the natural key fields in the check would make sense. Let's not even consider the complication of multiple people having the same name!

    I'm hard pressed to think of how you might do that in generated code though, unless you can somehow encode the logic of the keys in metadata.

    What makes me somewhat wary is the idea that someone needs a code generator to check for duplicates across multiple tables. That would raise red flags on the data model/design for me.

    All the same, if you're beginning the process of repairing such a model, having a code generator to check for the duplicates would make sense. I'd still exlude keys, as checking for duplicates between/within natural keys would require manual intervention with some understanding of the logic forming the keys, far beyond the scope of a code generator.

  • Bruce W Cassidy (8/11/2011)


    All the same, if you're beginning the process of repairing such a model, having a code generator to check for the duplicates would make sense. I'd still exlude keys, as checking for duplicates between/within natural keys would require manual intervention with some understanding of the logic forming the keys, far beyond the scope of a code generator.

    I certainly agree with that, and that needing a code generator because multiple tables may have duplicates raises red flags for the schema design . But although understanding the logic of the keys is beyond the scope of a code generator, sometimes (when you are trying to fix someone's disastrously undesigned database and you can't elicit the business rules from the business experts because they are not prepared to put the effort into explaining things rather than just answering yes/no questions, even though you don't know enough about the business to formulate those questions) a code generator can be used to suggest questions about what is or is not business rule.

    What usually happens is that someone clueless has thrown together a database, they've heard that tables should have primary keys so the add an identity column (or a GUID, or something else like that) to some tables to be the primary key (but not to other tables, because they think they don't need them), don't insert any constraints at all for natural keys (no "not null" anywhere except the automatic ones on those fairly meaningless primary keys, no unique constraints at all, usually no foreign key constraints, often no check constraints), and fairly clueless application code that does nothing for data integrity. After 6 months of intensive use the thing has grown into something quite horrid, performance is hopeless (there aren't any indexes either), results are unreliable, lots of what should be keys have duplicates, and someone gets brought in to sort out the mess - first thing he writes is a code generator for finding duplicates in various column sets.

    If he manages to form a mental model of the business rules (they were never written down, that would have been too bureaucratic an approach for something as simple as a mission-critical database; and if there's a requirements document it's all at arm-waving level and is anyway the first draft before any adjustments were made to fit the constraints of reality, because the adjustments were all documented by word-of-mouth only at unminuted meeting) he may even be able to guess what the natural keys are/ought to be, but of course the schema isn't actually in 2NF or remotely near it (it's pure good luck if it's even in 1NF) because the original "design" never paid any attention to the concept of keys, let alone 3NF or EKNF, so he has a lot or normalising to do as well as sorting out key constraints before he has anything resembling a sane database. If he can't discover the business rulesby asking people he may even want to use a code generator to generate queries to discover things that look as if they may be candidate keys and ask whether they are implied by the needs of the business (if they are he's discovered a business rule) and maybe another code generator to find things that look like functional dependencies and ask if they are real business needs to discover business rules that demonstrate that the tables are far from normalised, that undesirable partitioning has occurred, or (most likely) both; but if things are that bad he has a nightmare on his hands.

    I suspect that the OP on this thread found himself in something like that position - maybe not an extreme case of it, but at the least a mild case - perhaps even so mild that the only bad thing that's happened is the insertion of unwanted duplicates.

    Tom

  • Hi Tom,Nice to hear my environment is far better than I thought. The picture you paint is bleak.

    My circumstance is legacy columns, too many indexes on high trans tables (one table has 17), not enough of static tables, and many unused indexes(on the table with 17, 5 are well used, 3 are rarely used).

    I was trying to figure out why a certain index was not being used by optimiser (after checking stats) and lo and behold, there were copies of what should have been a unique non clustered index having 2 columns as a key.So I think the optimiser decided that the total returned was more effort than a scan.

    The index was not set as unique, and hence the app was free to do bad things.

    Since I am a tester, it led me to ask another question.

    How many duplicates are there across the database that we can do anything about?

    Where do the dup's come from?

    The forum has been extremely helpful in getting answers.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • MadTester (8/12/2011)


    Hi Tom,Nice to hear my environment is far better than I thought. The picture you paint is bleak.

    My circumstance is legacy columns, too many indexes on high trans tables (one table has 17), not enough of static tables, and many unused indexes(on the table with 17, 5 are well used, 3 are rarely used).

    I'm glad you don't have one of the total nightmares to deal with - I've been landed with one of those, and I wouldn't wish it on anyone. They are not totally bleak if you can convince management the mess needs fixing, but that may be difficult.

    If you have a problem with non-clustered indexes being rarely used, it may be that they are missing something that would make them useful to the optimiser. I suggest you read through the thread beginning here where someone else had exactly that problem with a non-clustered index. Of course most of your indexes won't be like that - 17 indexes on one table sounds crazy (as if someone made a practise of adding an index whenever he had to try to solve a performance problem, but if they turned out not to help neither removed them nor modified them so that they did help).

    I was trying to figure out why a certain index was not being used by optimiser (after checking stats) and lo and behold, there were copies of what should have been a unique non clustered index having 2 columns as a key.So I think the optimiser decided that the total returned was more effort than a scan.

    The index was not set as unique, and hence the app was free to do bad things.

    That of course is why every UNIQUE constraint that is implied by the business rules should always be implemented by a constraint in the database. Similarly, in tables that define surrogate keys, the natural key should be the primary index (perhaps non-clustered) and the surrogate key should have unique and not null constraints (and perhaps that unique constraint should be a clustered index) to enable it to be the target of a foreign key (and perhaps to allow efficient joins on it). Just getting that much of constraints right prevents a lot of data corruption that could otherwise be cause by application bugs. (Advertisement: read my series of articles on normalisation if you are not familiar with looking at normalisation as a means of having the database schema enforce rules and prevent bugs instead of trying to do it in database and/or application code; the first three articles are available so far, more to come.)

    Since I am a tester, it led me to ask another question.

    How many duplicates are there across the database that we can do anything about?

    Where do the dup's come from?

    The forum has been extremely helpful in getting answers.

    It might be useful to have some code to log when duplicates are created - if it logs enough information you can maybe find exactly which app is doing it. Then it may be possible to identify the app bug and get the app fixed. Usually changing the database to forbid duplicates without first fixing the app will be unacceptable, because if you do that then until the app is fixed the production system doesn't work.

    Getting rid of existing duplication may be easy or hard, depending on how far the effects of the duplications have spread. For example, duplication in simple link tables can almost always just be got rid of without any worries; duplication (over all columns excluding a meaningless identitity column) in tables that are the target of foreign key relationships is easy to eliminate, it's just a matter of updating the sources of those foreign keys to point at one of the set of duplicates and deleting the other rows in the set; duplicates that include what should have been the primary key but exculde other columns as well as the meaningless id are hard to deal with, because two things that point into that table may see different properties (corresponding to the different identities) for what should be the same entity, and you have to work out which properties are correct.

    Unwanted duplication should probably be removed as far as possible even if the app can't be fixed immediately, despite it then being necessary to do it again when the app has added more unwanted duplicates. This is because it probably has a high cost both in performance and in storage (think storage of backups as well as the database itself), which can turn into a high cost in dollars (or in pounds or euros or dirhams or rupees).

    Tom

  • For anyone wanting a complete solution, this one was customised to avoid looking at tables containing identities, as well as Unique Indexes.

    This saves a huge chunk of time in skipping tables which are obviously void of duplicates.

    create procedure dbo.finddups

    @schema sysname,

    @table sysname

    as

    declare @sql varchar(max)

    SET @sql = ''

    IF( ((select TOP 1 is_identity from sys.tables INNER JOIN sys.columns ON sys.tables.object_id = sys.columns.object_id where sys.tables.name = @table order BY is_identity desc) = 0)AND ((SELECT TOP 1 is_unique FROM sys.tables INNER JOIN sys.indexes ON sys.tables.object_id = sys.indexes.object_id WHERE sys.tables.name = @table ORDER BY is_unique DESC) = 0))

    BEGIN

    select @sql = @sql+','+'['+C.name+']' from sys.columns C

    inner join

    (select name, object_id, schema_id from sys.tables where name = @table) T

    on T.object_id = C.object_id AND c.is_identity = 0

    inner join (select name, schema_id from sys.schemas where name = @schema) S

    on S.schema_id = T.schema_id

    option (MAXDOP 1)

    set @sql = 'select count(*),'+SUBSTRING(@sql,2,len(@sql))+' from '+@schema+'.'+@table+

    ' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'

    exec(@sql)

    End

    GO

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Jeff Moden (8/10/2011)


    As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.

    That was my first thought. No primary keys set on tables? I bet if you just studied the table designs for a bit, you'd be able to tell the likely candidates for duplicates, given the nature of the data. Not a definitive answer, but one that puts you on a path to understanding the data well enough to set good primary keys to prevent this from happening in the future.

    [font="Verdana"]Please don't go. The drones need you. They look up to you.[/font]
    Connect to me on LinkedIn

  • Thomas Abraham (8/17/2011)


    Jeff Moden (8/10/2011)


    As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.

    That was my first thought. No primary keys set on tables? I bet if you just studied the table designs for a bit, you'd be able to tell the likely candidates for duplicates, given the nature of the data. Not a definitive answer, but one that puts you on a path to understanding the data well enough to set good primary keys to prevent this from happening in the future.

    I wonder why I even bothered to post this meager offering after reading Tom's posts. Didn't see his until after I had posted mine.:hehe:

    [font="Verdana"]Please don't go. The drones need you. They look up to you.[/font]
    Connect to me on LinkedIn

  • Thomas Abraham (8/17/2011)


    I wonder why I even bothered to post this meager offering after reading Tom's posts. Didn't see his until after I had posted mine.:hehe:

    If it was worth saying once it was worth saying twice.

    And saying it so that it can be seen as something simple is always a useful contribution to go along side the complex view.

    Definitely you were right to bother.

    Tom

  • Thomas Abraham (8/17/2011)


    Thomas Abraham (8/17/2011)


    Jeff Moden (8/10/2011)


    As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.

    That was my first thought. No primary keys set on tables? I bet if you just studied the table designs for a bit, you'd be able to tell the likely candidates for duplicates, given the nature of the data. Not a definitive answer, but one that puts you on a path to understanding the data well enough to set good primary keys to prevent this from happening in the future.

    I wonder why I even bothered to post this meager offering after reading Tom's posts. Didn't see his until after I had posted mine.:hehe:

    Amplification of an extremely valid point is always appreciated. Well done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Excluded tables with primary keys, since they enforce unique constraint even if the index is not set as unique.

    Additionally, this will run on the entire database.

    The table name has been assigned to the count(*) column so that we know what table yeilds the duplicates.

    I have a feeling that Is_Identity is a subset of PK, so removing those joins may yield the same results. Anyway, was too tired of flogging this dead horse to make the script perfect.

    Use with care, i.e. not on prod, or if on prod, when no users are active.

    -----------------------------------------------------

    DECLARE @sql varchar(max), @table varchar(100)

    DECLARE table_cur CURSOR

    FOR SELECT table_name

    FROM INFORMATION_SCHEMA.TABLES t

    LEFT OUTER JOIN sys.columns c ON c.object_id = OBJECT_ID(t.table_name)

    AND c.is_identity = 1-- GET ALL IDENTITY COLUMNS

    LEFT OUTER JOIN sys.indexes i ON i.object_id = OBJECT_ID(t.table_name)

    AND i.is_unique = 1-- GET ALL UNIQUE INDEXES

    LEFT OUTER JOIN sysobjects O ON o.id = OBJECT_ID(t.table_name)

    AND o.xtype = 'PK'

    WHERE c.object_id IS NULL-- ie an identity column doesn't exist

    AND i.object_id IS NULL-- ie a unique index doesn't exist

    AND o.id IS null

    AND t.table_type = 'BASE TABLE'

    AND t.TABLE_name NOT LIKE '%temp%' -- These would be uninteresting.

    AND t.TABLE_NAME NOT LIKE '%tmp%'

    ORDER BY table_name DESC

    OPEN table_cur

    FETCH NEXT FROM table_cur INTO @table

    WHILE@@FETCH_STATUS = 0

    BEGIN

    SET @sql = 'COUNT(*) '+@table

    SELECT @sql = RTRIM(@sql) + ', [' + c.column_name + ']'

    FROM INFORMATION_SCHEMA.COLUMNS c

    WHERE table_name = +RTRIM(@table)

    SET @sql = 'SELECT' + RTRIM(@sql) + ' FROM[' + RTRIM(@table) + '] '

    +'GROUP BY ' + REPLACE(@sql, 'COUNT(*) '+@table+',', '') +' HAVING COUNT(*) > 1'

    PRINT @sql

    EXEC ( @sql

    )

    FETCH NEXT FROM table_cur INTO @table

    END

    CLOSE table_cur

    DEALLOCATE table_cur

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • MadTester (8/19/2011)


    I have a feeling that Is_Identity is a subset of PK, so removing those joins may yield the same results.

    It usually is, as we usually use an identity to create an artificial key, and identity values make a good choice for a unique clustered index (due to the way SQL Server stores the data). But there is nothing that says it has to be, so it's worthwhile doing the due diligence of excluding them both.

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply