May 16, 2008 at 11:08 am
Hi,
we have a fatally flawed 3rd party app which has unfortunately populated a table with duplicates. To confuse me even more only one column is unique (the clustered index) and ALL other columns have been duplicated a variable no of times.
Anyone got an efficient delete statement for this please? I keep ending up with a query plan that does a hash join on two work tables which I think is the whole table twice (cartesian join?), and it runs VERY slow.
Table has 6.3 million rows and is 80MB in size. I suspect about 3/4 of the table is duplicates so a select out may be best way to go.
schema:
CREATE TABLE [dbo].[AssociationLinks] (
[RecordId] [int] IDENTITY (1, 1) NOT NULL ,
[WorkspaceEntityId] [int] NOT NULL ,
[SourceEntityId] [int] NOT NULL ,
[TargetEntityId] [int] NOT NULL ,
[Name] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[RecordStatus] [int] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[AssociationLinks] WITH NOCHECK ADD
CONSTRAINT [PK_AssociationLinks] PRIMARY KEY CLUSTERED
(
[RecordId]
) WITH FILLFACTOR = 80 ON [PRIMARY]
GO
CREATE INDEX [AssociationLinks8] ON [dbo].[AssociationLinks]([SourceEntityId], [WorkspaceEntityId], [TargetEntityId]) ON [PRIMARY]
GO
CREATE INDEX [AssociationLinks2] ON [dbo].[AssociationLinks]([TargetEntityId], [WorkspaceEntityId], [SourceEntityId], [Name]) ON [PRIMARY]
GO
---------------------------------------------------------------------
May 16, 2008 at 9:59 pm
Which record ID are you going to keep when dupes are present?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 17, 2008 at 2:31 am
Jeff, hi. This is the code the vendors sent thru for the delete which just does not perform and I hope can be improved upon
delete from T1
from AssociationLinks T1, AssociationLinks T2
where T1.workspaceentityid = T2.workspaceentityid
and T1.sourceentityid = T2.sourceentityid
and T1.targetentityid = T2.targetentityid
and T1.name = T2.name
and T1.recordstatus = T2.recordstatus
and T1.recordid > T2.recordid
so it looks like the min(recordid) for each set of duplicates is to be kept. There are no foreign keys referencing this table.
george
---------------------------------------------------------------------
May 17, 2008 at 8:17 am
Have a go with this.
delete from AssociationLinks
where RecordId not in
(
select min(RecordId)
from AssociationLinks
group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus
having count(*)>1
)
I tried this with a million randomly generated rows of which 90% are duplicates and it takes about 40 seconds. You can lower this by putting the select into a temporary table and generating a unique index on RecordId.
May 17, 2008 at 9:15 am
The problem is that both pieces of code will work very fast when there's only a couple thousand dupes and that's the the case here... George said that there's 6.3 Million rows and about 3/4 of them a dupes which means that a little over 4.5 Million rows must be deleted.
There's a point where SQL Server just looses it's mind on Updates and Deletes and it seems to occur at a little over 2 Million rows on my machine. I call it the "tipping" point. It's symptoms are where it might only take 40 seconds to update or delete a Million rows and only 80 seconds to update or delete 2 Million rows, but try 3 Million rows (for example) and the code suddenly takes several days to run. I've not been able to narrow down the cause, but I think the computer gets driven into a "swap file" mode of sorts.
The key to deleting that many rows is not in the detection of the rows (although that is certainly important), but in the number of rows being deleted. With that in mind, I recommend a "Delete Crawler" that deletes the rows in chunks... it's one of the very few places where I'll use a loop.
Here's an example "Delete Crawler" from another time I had to do such a thing... you might also want to truncate the transaction log at the end of each iteration using Backup with the Truncate_Only option... the code also build it's own test data so you can "play" until you think you have it right... PLEASE... READ THE COMMENTS AND DON'T TEST THIS (or anything, for that matter) ON A PRODUCTION BOX! Change the "DELETE" near the end of the code to simulate your requirements.
--===== If the test table exists, drop it
IF OBJECT_ID('dbo.JBMTestDetail','U') IS NOT NULL
DROP TABLE dbo.JBMTestDetail
GO
--===== Create and populate a 5,000,000 row test table.
-- Column "ConnID" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)
-- Column "Key1" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "Time_Stamp" has a range of >=01/01/2005 and <01/01/2015 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
SELECT TOP 5000000 --Takes about 5 minutes to build this... start at a million, if you want
ConnID = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(30)),''), --(10 rows per connection)
Key1 = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(20)),''), --just to test index with
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
Time_Stamp = ISNULL(CAST(RAND(CHECKSUM(NEWID()))*3652.0+38351.0 AS DATETIME),0),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTestDetail
FROM Master.dbo.SysColumns t1,
Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN
--===== Create indexes similar to Troy's
CREATE CLUSTERED INDEX IXC_JBMTestDetail_Time_Stamp_ConnID ON dbo.JBMTestDetail (Time_Stamp,ConnID)
CREATE NONCLUSTERED INDEX IX_JBMTestDetail_ConnID_Key1 ON dbo.JBMTestDetail (ConnID,Key1)
GO
--===== Setup to measure performance...
SET STATISTICS TIME ON
--===== Define the cutoff date with a time of "midnight" or, if you will,
-- define the cutoff date with no time so we only delete whole days.
DECLARE @CutoffDate DATETIME
SELECT @CutoffDate = DATEADD(dd,DATEDIFF(dd,0,GETDATE()),-94)
--===== Limit all further queries, including deletes, to 25,000 rows
-- (about 1 second worth of deletes, like I said before)
SET ROWCOUNT 25000 --You can make this bigger, but I wouldn't do more than a million
--===== See if any rows qualify for deletion. If even just one exists,
-- then there's work to do and @@ROWCOUNT will be > 0.
-- DO NOT PUT ANY CODE BETWEEN THIS SELECT AND THE WHILE LOOP OR
-- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP
SELECT TOP 1 1 FROM dbo.JBMTestDetail WHERE Time_Stamp < @CutoffDate
--===== If the rowcount from the above is greater than 0,
-- then delete 25,000 rows at a time until there's nothing
-- left to delete
WHILE @@ROWCOUNT > 0
BEGIN
--===== Just a "marker" to separate the loop in the output
PRINT REPLICATE('=',78)
--===== This delay gives other processes breathing room
WAITFOR DELAY '00:00:10'
--===== Do the delete. Will be limited by the SET ROWCOUNT above.
-- DO NOT PUT ANY CODE BETWEEN THIS DELETE AND THE "END" OR
-- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP.
-- Change this part to match your requirements
DELETE dbo.JBMTestDetail WITH (TABLOCKX)
WHERE Time_Stamp < @CutoffDate
END
--===== Restore the ability to process more than 25,000 rows
SET ROWCOUNT 0
SELECT 1000000 - COUNT(*) FROM jbmtestdetail
--Jeff Moden
Change is inevitable... Change for the better is not.
May 17, 2008 at 9:17 am
Oh yeah... almost forgot... Like you said, George, it will likely be cheaper to insert good rows into a new table because you won't have indexes and the like to update with a delete. The sub-query in Micheals code would be very good for determining which rows to copy.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 17, 2008 at 10:37 am
thankyou both.
Just to re-cap Jeff, you reckon best option to use Michaels subquery to select out to a new table, and will be faster than the delete? That will give me a temp table (say) with all correct recordids and then I can select rows out from the table where recordid in (select recordid from temp table)?
Happy to try both and get back with figures, practice will be good for me. I will be testing out on our UAT box.
BTW Any top tips on how to think set based? My focus is production DBA and my coding heyday was in the days of cobol batch file processing, which I was good at, but never really cut it as an OLTP programmer. I need to do more of this stuff as there is more credit in curing problems and you can only go so far with improving indexing strategy and physical design.
---------------------------------------------------------------------
May 17, 2008 at 12:57 pm
george sibbald (5/17/2008)
thankyou both.Just to re-cap Jeff, you reckon best option to use Michaels subquery to select out to a new table, and will be faster than the delete? That will give me a temp table (say) with all correct recordids and then I can select rows out from the table where recordid in (select recordid from temp table)?
If you do it right, the new table will have full records. Michaels subquery will give you all the correct RecordID's... use that as a driver for an outer query to do a SELECT/INTO on another table.
BTW Any top tips on how to think set based?
Yes... first step is to start thinking about what you want to do to a column instead of what you want to do to each row. It's a huge paradigm shift for most people, but if you just forget about the rows and start thinking about columns, someday soon, the lightbulb will go off and you'll ask yourself "This is easy... why haven't I done it this way all along"?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 17, 2008 at 1:03 pm
... like this...
SELECT d.*
INTO new_table_name
FROM (--=== Michael's subquery with a slight mod
select min(RecordId)
from dbo.AssociationLinks
group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus)d
-- having count(*)>1 don't need this anymore
INNER JOIN dbo.AccociationLinks al
ON al.RecordID = d.RecordID
If you don't have any foreign keys on the original table, you should just be able to truncate it after the new table has been verified, and copy the rows back in. Or, drop the old table and rename the new table and add the indexes/constraints back to it.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 17, 2008 at 1:37 pm
Jeff, cheers mate, code as well!
I like it, all in one step with no temp table stage.
I think I get it. select min recordid for each group of duplicates, join it with itself and output that row to new table (inner join needed because I want all the COLUMNS output associated with the min recordid COLUMN)
I'll try this out when I get back to the office monday, and try to think about code problems from a column perspective.
george
......and thanks michael for the find duplicates input
---------------------------------------------------------------------
May 18, 2008 at 3:42 am
First of all, the delete statement I posted has a slight problem. It works correctly for the duplicates and would delete the duplicates and leave only one row with the lowest RecordId. But at the same time it would also delete all the unique rows. Thus the having-clause must be removed for this delete statement - as Jeff noted - sort of.
Jeff, as for the performance issue when deleting a large subset of rows from a large table, I cannot at all confirm this. I really don't understand why. Thus I ran your code and generated 5 million rows followed by doing a
delete from JBMTestDetail where Time_Stamp>='20060101' and Time_Stamp<='20131231'
which deletes 80% of the rows in about 6 minutes.
I also generated random data in George's table with 6 million rows of which 87% are duplicates and my delete statement takes 3.5 minutes. By changing the sub-select to a real table with a unique index I can get it down to 2.5 minutes.
I tried this on SS2000PE/SP4 and SS2005SE.
By the way, on SS2005 you must not have a unique clustered index on your table, otherwise generating data using newid() produces the same value for all columns of each row. Thus try the following:
drop table t
create table t(
i int identity(1,1),
f1 int,
f2 int)
create unique clustered index t_1 on t(i)
go
insert into t(f1,f2)
select top 1000 abs(checksum(newid())),abs(checksum(newid()))
from master.dbo.syscolumns t1,master.dbo.syscolumns t2
go
select * from t
May 18, 2008 at 4:38 am
Michael,
thanks for the info. so the delete should be:
delete from AssociationLinks
where RecordId not in
(
select min(RecordId)
from AssociationLinks
group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus
)
and the having count(*)>1 is not required because the condition is where recordid NOT in
, so has a count of 1, correct?
I can try all 3 methods, select out, this delete and crawler delete. Would not any large delete need to be put into a loop to batch it up though, to avoid blowing the tran log?
BTW, just noticed a small typo in my original post, the tale size is 800MB, not 80 (inc. indexes)
---------------------------------------------------------------------
May 18, 2008 at 10:05 am
If your transaction log is 'going to blow' it will do so with or without the delete crawler. Nevertheless, check your db recovery mode setting.
May 18, 2008 at 10:36 am
Michael Meierruth (5/18/2008)
If your transaction log is 'going to blow' it will do so with or without the delete crawler. Nevertheless, check your db recovery mode setting.
Good catch... I should have said that it will keep the log file from "blowing" if the recovery mode is set to simple. My appologies for the omission. Thanks, Michael.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 19, 2008 at 6:10 am
Just my thoughts...
-- Get lowest ID's (whether dups exist or not)
SELECTMIN(RecordID) as 'RecordID'
INTO#KeptIDs
FROMAssociationLinks
GROUP BY
WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus
-- Snapshot all cols for these ID's
SELECT*
INTO#KeptData
FROMAssociationLinks
WHERERecordID IN (SELECT RecordID FROM #KeptIDs)
-- Clear table
TRUNCATE TABLEAssociationLinks
-- Re-populate table with "clean" data
-- nb you'll need to toggle IDENTITY INSERT depending on table settings
INSERT INTOAssociationLinks
SELECTRecordID,
WorkspaceEntityId,
SourceEntityId,
TargetEntityId,
Name,
RecordStatus
FROM#KeptData
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply