Delete duplicate rows from ANY table.

  • ben.brugman (8/4/2015)


    Gazareth (8/4/2015)

    Either of these should work, but you're back to the problem of determining column names:

    Yes back to the problem of determining the column names. And then de solution becomes 'far' worse than a number of other presented solutions. (A row_number partinioned by the the column names is than superiour in all (or most) aspects).

    But because it is rare for a table with an identity column to have duplicates I am happy enough with the solution which does not cover identity situations.

    Ben

    And we're back to the point of defining duplicates. An identity can prevent a whole row to be a duplicate even when it should be considered as a duplicate.

    This code will generate 10 rows that will be exactly the same with a different value on the identity column. Are they duplicates or not?

    CREATE TABLE dbo.EmployeeDupTest(

    Employee intIDENTITY

    ,Title varchar(16)

    ,FirstName varchar(100)

    ,MiddleName varchar(100)

    ,LastName varchar(100)

    ,Suffix varchar(20)

    ,JobTitle varchar(100)

    ,PhoneNumber varchar(50)

    ,PhoneNumberType varchar(100)

    ,EmailAddress varchar(100)

    ,EmailPromotion int

    ,AddressLine1 varchar(120)

    ,AddressLine2 varchar(120)

    ,City varchar(60 )

    ,StateProvinceName varchar(100)

    ,PostalCode varchar(30)

    ,CountryRegionName varchar(100)

    )

    INSERT INTO dbo.EmployeeDupTest(

    Title

    ,FirstName

    ,MiddleName

    ,LastName

    ,Suffix

    ,JobTitle

    ,PhoneNumber

    ,PhoneNumberType

    ,EmailAddress

    ,EmailPromotion

    ,AddressLine1

    ,AddressLine2

    ,City

    ,StateProvinceName

    ,PostalCode

    ,CountryRegionName

    )

    SELECT TOP(10)

    NULL

    ,'Ben'

    ,'T'

    ,'Miller'

    ,NULL

    ,'Buyer'

    ,'151-555-0113'

    ,'Work'

    ,'ben0@adventure-works.com'

    ,0

    ,'101 Candy Rd.'

    ,NULL

    ,'Redmond'

    ,'Washington'

    ,'98052'

    ,'United States'

    FROM sys.columns

    This might be something that you won't see coded that way, but without the appropriate constraints this kind of duplicates can happen.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • ben.brugman (8/4/2015)


    ChrisM@Work (8/4/2015)


    ScottPletcher (8/3/2015)


    You can cancel the identity property by using a union or union all:

    select top (0) * into #D from D1 union all select top (0) * from d1

    insert into D1 select distinct * from #D

    Nice tip, thanks Scott.

    But the code results in a

    'An explicit value for the identity column in table 'D1' can only be specified when a column list is used and IDENTITY_INSERT is ON.'

    Ben

    See:

    exec sp_drop #D

    exec sp_drop D1

    go

    Create table D1 (

    text varchar(300),

    number1 int,

    OK INT IDENTITY(1,1))

    insert into D1 values('A', 123)

    insert into D1 values('B', 234)

    insert into D1 values('C', 345)

    insert into D1 values('A', 234)

    insert into D1 values('b', 234)

    select * into #D from D1 union select * from d1

    insert into D1 select distinct * from #D -- This results in a 'can only be specified when a column list' error

    Ben

    That's because you're inserting into the base table, D1, and not the newly created temp table, #D 😉

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Let's assume you start with a table called OldEmployees. It contains (4) rows; (2) of which are considered duplicates based on a requirement that EmailAddress be unique for each employee.

    create table OldEmployees

    (

    EmailAddress varchar(80) not null,

    EmployeeName varchar(80) not null,

    CreateDateTime date not null

    );

    insert into OldEmployees values

    ('jsmith@mycorp.com','John Smith','2015/06/02'),

    ('jsmith@mycorp.com','John C Smith','2015/06/05'),

    ('sblack@mycorp.com','Sue Black','2015/06/05'),

    ('sblack@mycorp.com','Susan Black','2015/06/11');

    Here we create a new table called NewEmployees with a primary key on EmailAddress. Also we place the option (ignore_dup_key = on) so that any duplicate insert will be ignored.

    create table NewEmployees

    (

    EmailAddress varchar(80) not null

    primary key with (ignore_dup_key = on),

    EmployeeName varchar(80) not null,

    CreateDateTime date not null

    );

    insert into NewEmployees

    select * from OldEmployees order by CreateDateTime desc;

    Selecting from OldEmployees into NewEmployees will insert distinct (2) rows. Which specific rows are inserted can be determined using the ORDER BY clause. In this example, we'll keep the rows with latest CreateDate.

    Duplicate key was ignored.

    (2 row(s) affected)

    sblack@mycorp.comSusan Black 2015-06-11

    jsmith@mycorp.comJohn C Smith 2015-06-05

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Gazareth (8/4/2015)

    For cases where there are children, you can avoid violating constraints by deleting in child -> parent order as I specified.

    No date gets deleted. All data has to move from one client to the other client.

    All data in the tables gets moved from one client to the 'other' same client.

    Except for some 'top' tables where the data is not removed, but moved to a historical table.

    (One of the adresses and such are moved to a historical table and removed from the 'actual' set).

    So no deletes.

    It's perfectly clear, and a fairly common problem.

    Perhaps my explanation of a solution wasn't clear enough. With some demo tables & data it would be easy enough to write a worked solution for you.

    Edited.

    The data is contained in a large number in a network of tables.

    One client can have a large number of rows in this network.

    The dependencies is a number of levels deep.

    The merge of two clients should be reversable.

    No information is deleted or taken from the database at any time.

    Most client tables have parents, most client tables have children.

    Tables can have multiple parents.

    Some tables reference themselves (directly or indirectly).

    Some clients are merged multiple times.

    The implementation is designed to work efficient.

    Merging is considered a special situation and is not considered to be a normal part of operation.

    End edit.

    Yes it is a fairly common problem, we did build a solution, but again it was not simple.

    (The unique row and delete question has nothing to do with this part of the thread.)

    Ben

  • Hello All, I have had a rough week*) and not the oppertunity to reply.

    Sorry for that.

    The original question how to delete a duplicate row has been answered extensively enough. In general that problem did occure in fairly simple situations. So the given answers for that were sufficient.

    During the thread a second discussion was 'created'. (I am partly the fault of that).

    This had notthing to do with the original question of deleting rows. But was a totaly sepparate issue.

    Luis Cazares (8/3/2015)


    ben.brugman (8/3/2015)


    Enforcing constriants often prevents actions. Most of the time it prevents 'illegal' actions. But sometimes it prevents actions, which are functional.

    In general I am in favor to enforce constraints in the database whenever feasable.

    Ben

    If a constraint prevents actions that are valid, it's a design fault. Can you give an example on a constraint that prevents actions which are functional?

    Two 'not' related discussions in a single thread might lead to confusion.

    (For me I had to follow the answers to see to which discussion each answer belonged).

    Sorry for the confusion.

    I could start up a new thread to create an example for the second discussion.

    But I'll only want to put in the time and effort when people are interrested in this.

    Because of the events of the past week I am fairly occupied.

    Thanks for your time and attention.

    Ben

    *)

    This week I have had two funerals.

    (My mother and my brother in law).

  • That discussion should be interesting. However, there's no hurry on creating it. Leave it for a better moment.

    My condolences go out to you and your family. I hope that you find comfort and peace.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (8/13/2015)


    My condolences go out to you and your family. I hope that you find comfort and peace.

    Thank you, Ben

  • A sepperate thread has been created, to 'continue' the discussion about how constraints can prevent certain actions to be performed.

    See:

    Merge Two Clients

    Ben

  • I've used the following resource to remove duplicates:

    http://www.sqlservercentral.com/articles/duplicate+data/102383/

    [/url]

  • DECLARE @T VARCHAR(300) = 'A_Table_Name' ; DECLARE @C NVARCHAR(4000) = '''--''' SELECT @C = @C + '
    ,'+COLUMN_NAME FROM information_schema.COLUMNS where TABLE_NAME = @T
    DECLARE @s-2 nvarchar(max) = 'WITH E AS(SELECT *, ROW_NUMBER() OVER(PARTITION BY '+@C+'
    ORDER BY (SELECT null)) R FROM '+@T+')
    DELETE FROM E WHERE R > 1;'

    PRINT @s-2
    EXECUTE sp_executesql @s-2;

    Revisiting the thread (after some time).
    Luis Cazares came up with a fairly generic solution. 
    What I did is 'simplify and shorten' his solution. This should work in most situations. **)

    Ben

    **) Not all datatypes are supported with this solution. (XML fields for example are not supported).
    But this should work for most situations. And using the PRINTed solution can be adapted, for specific situations, for example excluding specific columns from the 'compare'. Luis Cazares is more robust, because names are quoted etc.

    @T contains the table_name of the table.
    @C contains the column_names of the table. (Starts of with '--', to avoid removal of the first generated comma'. This is a lazy trick).
    @s-2 Contains the SQL query to be processed.
    E The name of the Common table, this is the same table with a rownumber R for the same rows.
    New lines within the code are placed to give a 'fairly' readeble result. (Columns names are on distinct lines).

  • That's cool.  Clever idea on both parts.

    The one thing that's still missing is the suggestion that someone needs to create PKs or AKs on the tables that had duplicates so that they don't have to go through this fire drill again. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, March 18, 2019 10:13 AM

    That's cool.  Clever idea on both parts.

    The one thing that's still missing is the suggestion that someone needs to create PKs or AKs on the tables that had duplicates so that they don't have to go through this fire drill again. 😀

    create PKs or AKs 
    Yes True. The times this probleme occur in 'my' surrounding, it was always in temporary tables, testmachines etc. Never on a production database. During conversions, testing, importing/exporting, one just 'plods' along to get the work done. Getting your hands dirty first and then come up with a good solution often works far better than starting of with a 'good' solution.

    And it is always nice to come up with a solution first and then correct somebodies working method. **)

    Thanks for your anwser,
    Ben

    **)
    It is so much more fullfilling to solve somebodies problem and then educate them. This is much nicer for the 'culprit' and for the educater as wel. And because you solved their problem they are taking in your advise. Telling somebody to do it different although it is excellent advise, is not always best received.
    So give a solution this time and prevent it a next time.

Viewing 12 posts - 46 through 56 (of 56 total)

You must be logged in to reply to this topic. Login to reply