Best way to eliminate duplicate sets of data

Question

Post reply

Best way to eliminate duplicate sets of data

guerillaunit

SSCarpal Tunnel

Points: 4249
More actions
February 13, 2012 at 9:41 am

#252858

I have a table that has accidently had 2 sets of the same data uploaded onto it. Unfortunately, there is no timestamp that can I can set a WHERE clause on to delete the extra set. What's the most efficient way to get rid of the extra data?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Jared SSC Guru Points: 61793 More actions · Answer 1

Some people use cursors, but I prefer the ROW_NUMBER solution:

;

WITH RowNumbers AS

(

SELECT columnA, columnB, columnC,

ROW_NUMBER() OVER (PARTITION BY columnA, columnB, columnC ORDER BY columnA, columnB, columnC) AS rn

FROM tableName

)

DELETE FROM RowNumbers WHERE rn > 1

Jared
CE - Microsoft

guerillaunit SSCarpal Tunnel Points: 4249 More actions · Answer 2

Works great. Is there a good article about how the row_number() function works? I did a google search, but the pages that are listed don't really explain how the function does its job

Jared SSC Guru Points: 61793 More actions · Answer 3

Did you see this one?http://msdn.microsoft.com/en-us/library/ms186734.aspx I think this is pretty good.

Jared
CE - Microsoft

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

Now that the deletion of duplicates has been resolved, you might want to take a look at the structure of your table and figure out what would make a good primary key so that this doesn't happen in the future. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 5

guerillaunit (2/13/2012)
What's the most efficient way to get rid of the extra data?

You can make the process more efficient still by using minimally-logged operations rather than the fully-logged DELETE statement:

CREATE TABLE #Test (columnA int NULL, columnB int NULL, columnC int NULL);

-- 100,000 rows

INSERT #Test

SELECT TOP (100000)

CHECKSUM(NEWID()),

CHECKSUM(NEWID())

FROM

master.sys.columns AS c,

master.sys.columns AS c2,

master.sys.columns AS c3;

-- Duplicate every row

INSERT #Test

SELECT * FROM #Test AS t

-- Minimally-logged SELECT...INTO

-- Uses the internal ANY aggregate

SELECT

Numbered.columnA,

Numbered.columnB,

Numbered.columnC

INTO #DeDuplicated

FROM

(

SELECT *, rn =

ROW_NUMBER() OVER (

PARTITION BY columnA, columnB, columnC

ORDER BY columnA, columnB, columnC)

FROM #Test AS t

) AS Numbered

WHERE

Numbered.rn = 1;

-- Minimally-logged TRUNCATE TABLE

TRUNCATE TABLE #Test;

-- Metadata-only SWITCH

ALTER TABLE #DeDuplicated SWITCH TO #Test;

-- Finished with this

DROP TABLE #DeDuplicated

-- Show duplicates have been removed

CREATE UNIQUE CLUSTERED INDEX cuq ON #Test (columnA, columnB, columnC);

SELECT * FROM #Test AS t;

-- Tidy up

DROP TABLE #Test;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 6

Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 7

Jeff Moden (2/13/2012)
Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.

One of the conditions for SWITCH is that things like keys, constraints and indexes must match, so in that sense, yes they are preserved. Given that the source table apparently has no primary key, I thought it reasonable to assume not much in the way of constraints or indexes existed either. Anyway, after such a large DELETE, one would probably want to rebuild all indexes anyway, and this could also be minimally-logged.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

SQL Kiwi (2/13/2012)
Jeff Moden (2/13/2012)
Nicely done, Paul. Will that preserve keys, constraints, and indexes? And, yes, I realize the TRUNCATE won't work if FK's are present.
One of the conditions for SWITCH is that things like keys, constraints and indexes must match, so in that sense, yes they are preserved. Given that the source table apparently has no primary key, I thought it reasonable to assume not much in the way of constraints or indexes existed either. Anyway, after such a large DELETE, one would probably want to rebuild all indexes anyway, and this could also be minimally-logged.

No problems on this so far as the OP goes. I know there isn't even a PK. I was more interested in something I might take on at work. And, dammit, I did read that the tables had to have the same "gizmos" but plumb forgot. Thanks for the reminder.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)