Delete duplicate rows from ANY table.

  • Is there is 'fairly' simple query to delete duplicate rows from ANY table ?

    A script which removes duplicates from a table, when a table_name is supplied.

    Ben

  • Hmm... I guess you could a script that would do a group by on every column in every table in your database and find the duplicates, but why do you have so many tables without primary keys?

  • The only thing that comes to my mind would require using Dynamic SQL. The difficult part would be knowing how to determine what is duplicated data? What columns are or are not used in the determination.

  • Lynn Pettis (7/31/2015)


    The only thing that comes to my mind would require using Dynamic SQL. The difficult part would be knowing how to determine what is duplicated data? What columns are or are not used in the determination.

    I don't see how you could do this without either also passing in the columns to be used with the table name, or having some sort of lookup table listing the columns to check for each table.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Alvin Ramard (7/31/2015)


    Lynn Pettis (7/31/2015)


    The only thing that comes to my mind would require using Dynamic SQL. The difficult part would be knowing how to determine what is duplicated data? What columns are or are not used in the determination.

    I don't see how you could do this without either also passing in the columns to be used with the table name, or having some sort of lookup table listing the columns to check for each table.

    Unless the criteria is every column that is not part of the primary key.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (7/31/2015)


    Alvin Ramard (7/31/2015)


    Lynn Pettis (7/31/2015)


    The only thing that comes to my mind would require using Dynamic SQL. The difficult part would be knowing how to determine what is duplicated data? What columns are or are not used in the determination.

    I don't see how you could do this without either also passing in the columns to be used with the table name, or having some sort of lookup table listing the columns to check for each table.

    Unless the criteria is every column that is not part of the primary key.

    True.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • I think a much faster algorithm would be to select a few identifying columns for the duplicate row and the binary_checksum(*) and group by those few columns having count(*) > 1. Be careful with binary_checksum since it skips xml columns. The function can take a subset of columns which might come in handy.

    Ultimately you will need to generate table specific code.

    I have generators for everything and just went through a process to merge two databases of 500 tables. There will always be a few tables that need hand coded.

    I push for PKs on every table so as to avoid this dupe mess.

  • ben.brugman (7/31/2015)


    Is there is 'fairly' simple query to delete duplicate rows from ANY table ?

    A script which removes duplicates from a table, when a table_name is supplied.

    Ben

    To answer the original question, NO there is no simple query for doing this.

    It can be done, but the script(s)/procedure(s) required is something that you would have to develop to match your needs.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • I agree, the table name is not enough information to eliminate duplicates, the columns defining the key are necessary.

    A lot of things can make a row unique even if it's supposed to be a duplicate, that's why you need to define the key.

    That said, here's some code to give you an idea on what you could do. I included a couple of validations, but you might need to add more and tweak the code to know what did it do. I wouldn't implement something like this as it might cause more problems than the ones it solves.

    DECLARE @Table_Name sysname = 'Production.ProductDescription'

    ,@Columns nvarchar(4000) = 'Description,ModifiedDate'

    DECLARE @sql nvarchar(max)

    SET @sql = 'WITH CTE AS(

    SELECT *, ROW_NUMBER() OVER( PARTITION BY ' +

    STUFF( (

    SELECT ', ' + QUOTENAME( name)

    FROM sys.columns c

    --JOIN DelimitedSplit8k(@Columns, ',') s ON c.name = s.name

    WHERE object_id = OBJECT_ID(@Table_Name)

    AND is_identity = 0

    AND system_type_id <> 36 -- uniqueidentifiers

    FOR XML PATH(''),TYPE

    ).value('.','varchar(max)'), 1, 2, '') + ' ORDER BY (SELECT NULL)) rn

    FROM ' + ISNULL( QUOTENAME( PARSENAME( @Table_Name, 2)) + '.', '') --Schema

    + QUOTENAME( PARSENAME( @Table_Name, 1)) --Table

    + '

    )

    DELETE FROM CTE

    WHERE rn > 1;'

    PRINT @sql

    EXECUTE sp_executesql @sql;

    I used AdventureWorks2012 to test and this returns an error due to a foreign key constraint which is something you would like to validate before starting to delete data.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • All thanks for your contributions.

    Luis Cazares, thanks for your solution. Next monday I'll study it and try it.

    Although not simple, I probably could wrap this up in a stored procedure.

    If there is a relational constraint, this can not be the 'master', because for implemented constraints there must be a unique key. If there is an implemented unique key then there can be no duplicates.

    At least I will study your solution for a column list closely to see what I can learn from that.

    (I build a sp_column_list stored procedure which can generate all types of column lists. Mainly for selection/update/insert with or without an identity key. But the SP is generic so it can generate all kinds of column_lists. For example all columns of three tables to join so that they all have their own unique name, used in joins and/or in CTE commands).

    There where some questions, I'll try to anwser them.

    Why so many tables without primary keys.

    Not many and most of them have primairy keys. (But not implemented primairy keys).

    Sometimes somebody inputs data in a just created table using studio manager and creates doubles and does not want to redo the entering.

    Within Studio manager you can not delete a double. With knowledge of the table it is often fairly easy to remove the doubles. But if there are a large number of rows with often the same values in the columns it is difficult to select the correct columns and selecting all columns is cumbersome.

    Another situation is when using import tables.

    Sometimes the imports are doubled because off different reasons. Importing the same set twice. Or a row might be double in the original dataset. Not to hinder the import these doubles are removed after the import.

    Group by on every column of a table.

    Possible, but cumbersome.

    I don't see how you could do this without either also passing in the columns

    Luis Cazares has shown that it is possible. (Just not simple).

    duplicate row and the binary_checksum(*)

    This is an idea, but performance is not an issue.

    And it is not for a large bulk operation, but for an occasional operation, some examples are mentioned above.

    Thanks everybody.

    I asked the question because sometimes there is a far more simpler solution than I came up with.

    In this case I can just use a stored procedure to create this function. I have a procedure sp_column_list that can generate the needed list of columns. Stored procedure will be something like:

    exec SP_Remove_duplicates @table_name ='example_table'

    All thanks for your time and attention,

    If anybody is interested in the SP_Columnlist stored procedure please tel.

    (The coding is a bit rough at the moment).

    Ben

  • You should avoid the sp_ prefix on your procedures. It is reserved for MS and denotes it as a system procedure. The performance issues are mostly corrected but ambiguity and potential issues are still there. My preference would be to drop the prefix entirely but if you insist on them something like usp_ would be better.

    That being said I would be interested in seeing the script you came up with.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • ben.brugman (7/31/2015)


    Why so many tables without primary keys.

    Not many and most of them have primairy keys. (But not implemented primairy keys).

    Not sure what that means. Your table either does or does not a primary key. Are you saying they are only logical primary keys and not defined as primary keys? Does those tables at least have a clustered index?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Technically "usp_" is illogical, since "sp_" stands for "special", but "usp_" wouldn't be meant as "user special".

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Just a couple of clarifications.

    To change the code into a stored procedure, you just change the variables declaration to become your SP header.

    The query does not use the column list provided. To use it, you must uncomment the JOIN line and use the DelimitedSplitN4k (wrote the wrong function out from habit) which you can find along with a great explanation in here: http://www.sqlservercentral.com/articles/Tally+Table/72993/

    The code is concatenating the columns that will be used as the key to identify duplicates. The explanation on how it works is defined in here: http://www.sqlservercentral.com/articles/comma+separated+list/71700/

    Other than that, it shouldn't be that difficult to understand. And yes, when you delete from a cte, you're actually deleting rows from the underlying table.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • ScottPletcher (7/31/2015)


    Technically "usp_" is illogical, since "sp_" stands for "special", but "usp_" wouldn't be meant as "user special".

    I would offer that a prefix of any kind is illogical since they don't provide any benefit. I would argue your point though. An acronym doesn't have to mean the same thing as another acronym just because they are spelled similarly.

    I have seen people using the usp_ prefix stating it means "user defined stored procedure".

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Viewing 15 posts - 1 through 15 (of 56 total)

You must be logged in to reply to this topic. Login to reply