May 23, 2008 at 5:36 pm
tosscrosby (5/23/2008)
Not a problem. I thought maybe you were calling SQL ORACLE out for a challenge! I don't have any issues here with duplicates. I just thought if you already had something, I'd like to see it. Like I said, all the different exposures help me in the long run.
Heh... yeaup... indeed I was... not very much like me.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 23, 2008 at 7:25 pm
Jeff Moden (5/23/2008)
I'll crank out an example tonight...
Wow! I think I just figured out what to write about in an article!
Terry,
IF there's and IDENTITY column on the table and all the the columns, including the date and, of course, excluding the IDENTITY column, and in-place DELETE of all but the "first" dupe would look like this...
DELETE t1
FROM dbo.JBMTest t1,
dbo.JBMTest t2
WHERE t1.ProductID = t2.ProductID
AND t1.CustomerID = t2.CustomerID
AND t1.OrderAmount = t2.OrderAmount
AND t1.OrderDate = t2.OrderDate
AND t1.JBMTestID > t2.JBMTestID
Yep... you see it... a Triangular Join. Like I said in the article about them, they're not all bad...
The neat thing about that code is it's fast (Deletes ~2700 dupes from a million row table in about 19 seconds) and it keeps the transaction log from ballooning. Only the rows that actually get deleted are logged.
Without the IDENTITY column (or some other chronological tie breaker), something like the following is a bit tough on the transaction log... especially if you do like I do and test this stuff on a million rows.
--===== Copy distinct rows to a temp table
SELECT DISTINCT
ProductID, CustomerID, OrderAmount, OrderDate
INTO #MyHead
FROM dbo.JBMTest
--====== Truncate the original table
TRUNCATE TABLE dbo.JBMTest
--===== Copy the now distinct/unique rows back to the original
INSERT INTO dbo.JBMTest
(ProductID, CustomerID, OrderAmount, OrderDate)
SELECT ProductID, CustomerID, OrderAmount, OrderDate
FROM #MyHead
--===== Drop the temp table (not required in a proc)
DROP TABLE #MyHead
That only takes about 24 seconds... but imagine the load on TempDB and imagine what all those inserts back to the original table do to the transaction log, not to mention the extreme IO.
I haven't found a faster method for either situation... but the moral of the story is that if the table isn't protected from dupes by some "natural" Primary Key or Unique Index, you really need to have an IDENTITY column to break the ties on dupes. All of my ETL tables are always built to have one just for this purpose. Sure, I could have a Unique Index with and IGNORE DUPES option, but that's slower than importing dupes and deleting them.
By the way, the code above that deletes based on the IDENTITY column deletes the dupes having the largest ID's leaving the "first" occurance. To keep the "last" occurance, instead, simply change the > to a <.
Oh yeah... here's the test table I used in case you want to play... it always generates about 2,700 or 2,800 dupes...
DROP TABLE JBMTest
GO
--===== Create and populate a 1,000,000 row test table.
SELECT TOP 1000000
JBMTestID = IDENTITY(INT,1,1),
ProductID = ABS(CHECKSUM(NEWID()))%100+1, --1 to 100
CustomerID = ABS(CHECKSUM(NEWID()))%500+1, --1 to 500
OrderAmount = ABS(CHECKSUM(NEWID()))%10+1, --1 to 10
OrderDate = CAST(ABS(CHECKSUM(NEWID())%364)+36524 AS DATETIME) --Some whole date in 2000
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1,
Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN
GO
--===== A table is not properly formed unless a Primary Key has been assigned
-- Takes about 1 second to execute.
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (JBMTestID)
GO
--Jeff Moden
Change is inevitable... Change for the better is not.
May 23, 2008 at 7:37 pm
Sorry about that... BitBucket's method for deleting dupes when an IDENTITY column is present is just as fast as what I posted. Here's his code "Modenized" for the test table I used...
;WITH
Numbered AS
(
SELECT RowNum = ROW_NUMBER() OVER
(PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate
ORDER BY ProductID,CustomerID,OrderAmount,OrderDate),
JBMTestID
FROM dbo.JBMTest
)
DELETE Numbered
WHERE RowNum > 1
It's pretty easy on the I/O, as well and protects the log file from unnecessary growth in a manner similar to mine.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 24, 2008 at 11:41 am
Jeff,
Wouldn't something like this work faster? I normally use this technique to delete duplicates and I've never had a performance problem:
DELETE T1
FROM dbo.JBMTest T1
INNER JOIN
(SELECT ProductID, CustomerID, OrderAmount, OrderDate
, MIN(JBMTestID) AS JBMTestID
FROM dbo.JBMTest
GROUP BY ProductID, CustomerID, OrderAmount, OrderDate
HAVING COUNT(*) > 1
) AS X ON
T1.ProductID = X.ProductID
AND T1.CustomerID = X.CustomerID
AND T1.OrderAmount = X.OrderAmount
AND T1.OrderDate = X.OrderDate
AND T1.JBMTestID > X.JBMTestID
Todd Fifield
May 24, 2008 at 4:13 pm
Heh... TODD!!!??? Why are you having me figure it out... run it against the table generator I provided and report your findings! π
The answer to your question is "No", your way is not faster... your code took 00:06:43 on my machine to dupe check/dupe delete the million row example I gave. That's a wee bit more than the 00:00:19 than the triangular join or the CTE that Bit Bucket wrote. I haven't done a full analysis, but I think the GROUP BY is what is killing the performance on yours.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 25, 2008 at 11:31 am
Jeff,
You got me on that one! I didn't check it against a million rows - oops!
I'll try the triangular join next time.
Todd
June 13, 2008 at 12:06 am
;)β add temporarily identity column.
alter table tName add tid int identity(1,1)
βquery to delete duplicate rows from table
delete from tName where tid not in
(select min(tid) from tName a where a.firstname = tName.firstname)
β drop temporarily added identity column
alter table tName drop column tid
June 13, 2008 at 12:30 pm
this link is more than enough...
June 13, 2008 at 1:15 pm
vinuraj (6/13/2008)
this link is more than enough...
... to bring your server right to it's knee's. It took over 10 minutes to find and delete a lousy 156 rows from a million row test table.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 13, 2008 at 1:32 pm
Other than the variable about whether it's faster to select out what you want to keep, or delete out the stuff you don't want, I'd say most of the workable solutions are already covered... (and some less workable too)
Of course if you're defining duplicates as all of the copies of some data AFTER the original one, you might care to slightly adjust the "Modenized CTE" just a smidgen:
;WITH
Numbered AS
(
SELECT RowNum = ROW_NUMBER() OVER
(PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate
ORDER BY ProductID,CustomerID,OrderAmount,OrderDate,JBMTestID ),--order the dupes by ID
JBMTestID
FROM dbo.JBMTest
)
DELETE Numbered
WHERE RowNum > 1
Depending on the data distribution, this should edge out the triangular join (i.e if the dupe groups are large - this should win out).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
June 14, 2008 at 7:11 am
Dear Sir,
i m new in this site.
you can apply this simple script for deleting duplicate record from table
Regards
Rishi Kumar
June 14, 2008 at 5:31 pm
Which script?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 16 through 26 (of 26 total)
You must be logged in to reply to this topic. Login to reply