Best way to eliminate duplicate sets of data

  • I have a table that has accidently had 2 sets of the same data uploaded onto it. Unfortunately, there is no timestamp that can I can set a WHERE clause on to delete the extra set. What's the most efficient way to get rid of the extra data?

  • Some people use cursors, but I prefer the ROW_NUMBER solution:

    ;

    WITH RowNumbers AS

    (

    SELECT columnA, columnB, columnC,

    ROW_NUMBER() OVER (PARTITION BY columnA, columnB, columnC ORDER BY columnA, columnB, columnC) AS rn

    FROM tableName

    )

    DELETE FROM RowNumbers WHERE rn > 1

    Jared
    CE - Microsoft

  • Works great. Is there a good article about how the row_number() function works? I did a google search, but the pages that are listed don't really explain how the function does its job

  • Did you see this one?http://msdn.microsoft.com/en-us/library/ms186734.aspx I think this is pretty good.

    Jared
    CE - Microsoft

  • Now that the deletion of duplicates has been resolved, you might want to take a look at the structure of your table and figure out what would make a good primary key so that this doesn't happen in the future. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • guerillaunit (2/13/2012)


    What's the most efficient way to get rid of the extra data?

    You can make the process more efficient still by using minimally-logged operations rather than the fully-logged DELETE statement:

    CREATE TABLE #Test (columnA int NULL, columnB int NULL, columnC int NULL);

    -- 100,000 rows

    INSERT #Test

    SELECT TOP (100000)

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID())

    FROM

    master.sys.columns AS c,

    master.sys.columns AS c2,

    master.sys.columns AS c3;

    -- Duplicate every row

    INSERT #Test

    SELECT * FROM #Test AS t

    -- Minimally-logged SELECT...INTO

    -- Uses the internal ANY aggregate

    SELECT

    Numbered.columnA,

    Numbered.columnB,

    Numbered.columnC

    INTO #DeDuplicated

    FROM

    (

    SELECT *, rn =

    ROW_NUMBER() OVER (

    PARTITION BY columnA, columnB, columnC

    ORDER BY columnA, columnB, columnC)

    FROM #Test AS t

    ) AS Numbered

    WHERE

    Numbered.rn = 1;

    -- Minimally-logged TRUNCATE TABLE

    TRUNCATE TABLE #Test;

    -- Metadata-only SWITCH

    ALTER TABLE #DeDuplicated SWITCH TO #Test;

    -- Finished with this

    DROP TABLE #DeDuplicated

    -- Show duplicates have been removed

    CREATE UNIQUE CLUSTERED INDEX cuq ON #Test (columnA, columnB, columnC);

    SELECT * FROM #Test AS t;

    -- Tidy up

    DROP TABLE #Test;

  • Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/13/2012)


    Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.

    One of the conditions for SWITCH is that things like keys, constraints and indexes must match, so in that sense, yes they are preserved. Given that the source table apparently has no primary key, I thought it reasonable to assume not much in the way of constraints or indexes existed either. Anyway, after such a large DELETE, one would probably want to rebuild all indexes anyway, and this could also be minimally-logged.

  • SQL Kiwi (2/13/2012)


    Jeff Moden (2/13/2012)


    Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.

    One of the conditions for SWITCH is that things like keys, constraints and indexes must match, so in that sense, yes they are preserved. Given that the source table apparently has no primary key, I thought it reasonable to assume not much in the way of constraints or indexes existed either. Anyway, after such a large DELETE, one would probably want to rebuild all indexes anyway, and this could also be minimally-logged.

    No problems on this so far as the OP goes. I know there isn't even a PK. I was more interested in something I might take on at work. And, dammit, I did read that the tables had to have the same "gizmos" but plumb forgot. Thanks for the reminder.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply