DELETE DUPLICATE RECORDS

Question

DELETE DUPLICATE RECORDS

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 1

tosscrosby (5/23/2008)
Not a problem. I thought maybe you were calling SQL ORACLE out for a challenge! I don't have any issues here with duplicates. I just thought if you already had something, I'd like to see it. Like I said, all the different exposures help me in the long run.

Heh... yeaup... indeed I was... not very much like me.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 2

Jeff Moden (5/23/2008)
I'll crank out an example tonight...

Wow! I think I just figured out what to write about in an article!

Terry,

IF there's and IDENTITY column on the table and all the the columns, including the date and, of course, excluding the IDENTITY column, and in-place DELETE of all but the "first" dupe would look like this...

DELETE t1

FROM dbo.JBMTest t1,

dbo.JBMTest t2

WHERE t1.ProductID = t2.ProductID

AND t1.CustomerID = t2.CustomerID

AND t1.OrderAmount = t2.OrderAmount

AND t1.OrderDate = t2.OrderDate

AND t1.JBMTestID > t2.JBMTestID

Yep... you see it... a Triangular Join. Like I said in the article about them, they're not all bad...

The neat thing about that code is it's fast (Deletes ~2700 dupes from a million row table in about 19 seconds) and it keeps the transaction log from ballooning. Only the rows that actually get deleted are logged.

Without the IDENTITY column (or some other chronological tie breaker), something like the following is a bit tough on the transaction log... especially if you do like I do and test this stuff on a million rows.

--===== Copy distinct rows to a temp table

SELECT DISTINCT

ProductID, CustomerID, OrderAmount, OrderDate

INTO #MyHead

FROM dbo.JBMTest

--====== Truncate the original table

TRUNCATE TABLE dbo.JBMTest

--===== Copy the now distinct/unique rows back to the original

INSERT INTO dbo.JBMTest

(ProductID, CustomerID, OrderAmount, OrderDate)

SELECT ProductID, CustomerID, OrderAmount, OrderDate

FROM #MyHead

--===== Drop the temp table (not required in a proc)

DROP TABLE #MyHead

That only takes about 24 seconds... but imagine the load on TempDB and imagine what all those inserts back to the original table do to the transaction log, not to mention the extreme IO.

I haven't found a faster method for either situation... but the moral of the story is that if the table isn't protected from dupes by some "natural" Primary Key or Unique Index, you really need to have an IDENTITY column to break the ties on dupes. All of my ETL tables are always built to have one just for this purpose. Sure, I could have a Unique Index with and IGNORE DUPES option, but that's slower than importing dupes and deleting them.

By the way, the code above that deletes based on the IDENTITY column deletes the dupes having the largest ID's leaving the "first" occurance. To keep the "last" occurance, instead, simply change the > to a <.

Oh yeah... here's the test table I used in case you want to play... it always generates about 2,700 or 2,800 dupes...

DROP TABLE JBMTest

GO

--===== Create and populate a 1,000,000 row test table.

SELECT TOP 1000000

JBMTestID = IDENTITY(INT,1,1),

ProductID = ABS(CHECKSUM(NEWID()))%100+1, --1 to 100

CustomerID = ABS(CHECKSUM(NEWID()))%500+1, --1 to 500

OrderAmount = ABS(CHECKSUM(NEWID()))%10+1, --1 to 10

OrderDate = CAST(ABS(CHECKSUM(NEWID())%364)+36524 AS DATETIME) --Some whole date in 2000

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

GO

--===== A table is not properly formed unless a Primary Key has been assigned

-- Takes about 1 second to execute.

ALTER TABLE dbo.JBMTest

ADD PRIMARY KEY CLUSTERED (JBMTestID)

GO

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 3

Sorry about that... BitBucket's method for deleting dupes when an IDENTITY column is present is just as fast as what I posted. Here's his code "Modenized" for the test table I used...

;WITH

Numbered AS

(

SELECT RowNum = ROW_NUMBER() OVER

(PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate

ORDER BY ProductID,CustomerID,OrderAmount,OrderDate),

JBMTestID

FROM dbo.JBMTest

)

DELETE Numbered

WHERE RowNum > 1

It's pretty easy on the I/O, as well and protects the log file from unnecessary growth in a manner similar to mine.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

tfifield SSCrazy Eights Points: 9655 More actions · Answer 4

Jeff,

Wouldn't something like this work faster? I normally use this technique to delete duplicates and I've never had a performance problem:

DELETE T1

FROM dbo.JBMTest T1

INNER JOIN

(SELECT ProductID, CustomerID, OrderAmount, OrderDate

, MIN(JBMTestID) AS JBMTestID

FROM dbo.JBMTest

GROUP BY ProductID, CustomerID, OrderAmount, OrderDate

HAVING COUNT(*) > 1

) AS X ON

T1.ProductID = X.ProductID

AND T1.CustomerID = X.CustomerID

AND T1.OrderAmount = X.OrderAmount

AND T1.OrderDate = X.OrderDate

AND T1.JBMTestID > X.JBMTestID

Todd Fifield

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 5

Heh... TODD!!!??? Why are you having me figure it out... run it against the table generator I provided and report your findings! 😀

The answer to your question is "No", your way is not faster... your code took 00:06:43 on my machine to dupe check/dupe delete the million row example I gave. That's a wee bit more than the 00:00:19 than the triangular join or the CTE that Bit Bucket wrote. I haven't done a full analysis, but I think the GROUP BY is what is killing the performance on yours.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

tfifield SSCrazy Eights Points: 9655 More actions · Answer 6

Jeff,

You got me on that one! I didn't check it against a million rows - oops!

I'll try the triangular join next time.

Todd

saby SSCommitted Points: 1681 More actions · Answer 7

;)– add temporarily identity column.

alter table tName add tid int identity(1,1)

–query to delete duplicate rows from table

delete from tName where tid not in

(select min(tid) from tName a where a.firstname = tName.firstname)

– drop temporarily added identity column

alter table tName drop column tid

vinurajr SSCrazy Points: 2404 More actions · Answer 8

this link is more than enough...

http://www.sqlservercentral.com/scripts/TSQL/62866/

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 9

vinuraj (6/13/2008)
this link is more than enough...

... to bring your server right to it's knee's. It took over 10 minutes to find and delete a lousy 156 rows from a million row test table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

Other than the variable about whether it's faster to select out what you want to keep, or delete out the stuff you don't want, I'd say most of the workable solutions are already covered... (and some less workable too)

Of course if you're defining duplicates as all of the copies of some data AFTER the original one, you might care to slightly adjust the "Modenized CTE" just a smidgen:

;WITH

Numbered AS

(

SELECT RowNum = ROW_NUMBER() OVER

(PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate

ORDER BY ProductID,CustomerID,OrderAmount,OrderDate,JBMTestID ),--order the dupes by ID

JBMTestID

FROM dbo.JBMTest

)

DELETE Numbered

WHERE RowNum > 1

Depending on the data distribution, this should edge out the triangular join (i.e if the dupe groups are large - this should win out).

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

rishi03011980 Newbie Points: 1 More actions · Answer 11

Dear Sir,

i m new in this site.

you can apply this simple script for deleting duplicate record from table

Regards

Rishi Kumar

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 12

Which script?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)