help with duplicates delete please

  • Hi,

    we have a fatally flawed 3rd party app which has unfortunately populated a table with duplicates. To confuse me even more only one column is unique (the clustered index) and ALL other columns have been duplicated a variable no of times.

    Anyone got an efficient delete statement for this please? I keep ending up with a query plan that does a hash join on two work tables which I think is the whole table twice (cartesian join?), and it runs VERY slow.

    Table has 6.3 million rows and is 80MB in size. I suspect about 3/4 of the table is duplicates so a select out may be best way to go.

    schema:

    CREATE TABLE [dbo].[AssociationLinks] (

    [RecordId] [int] IDENTITY (1, 1) NOT NULL ,

    [WorkspaceEntityId] [int] NOT NULL ,

    [SourceEntityId] [int] NOT NULL ,

    [TargetEntityId] [int] NOT NULL ,

    [Name] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

    [RecordStatus] [int] NOT NULL

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[AssociationLinks] WITH NOCHECK ADD

    CONSTRAINT [PK_AssociationLinks] PRIMARY KEY CLUSTERED

    (

    [RecordId]

    ) WITH FILLFACTOR = 80 ON [PRIMARY]

    GO

    CREATE INDEX [AssociationLinks8] ON [dbo].[AssociationLinks]([SourceEntityId], [WorkspaceEntityId], [TargetEntityId]) ON [PRIMARY]

    GO

    CREATE INDEX [AssociationLinks2] ON [dbo].[AssociationLinks]([TargetEntityId], [WorkspaceEntityId], [SourceEntityId], [Name]) ON [PRIMARY]

    GO

    ---------------------------------------------------------------------

  • Which record ID are you going to keep when dupes are present?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, hi. This is the code the vendors sent thru for the delete which just does not perform and I hope can be improved upon

    delete from T1

    from AssociationLinks T1, AssociationLinks T2

    where T1.workspaceentityid = T2.workspaceentityid

    and T1.sourceentityid = T2.sourceentityid

    and T1.targetentityid = T2.targetentityid

    and T1.name = T2.name

    and T1.recordstatus = T2.recordstatus

    and T1.recordid > T2.recordid

    so it looks like the min(recordid) for each set of duplicates is to be kept. There are no foreign keys referencing this table.

    george

    ---------------------------------------------------------------------

  • Have a go with this.

    delete from AssociationLinks

    where RecordId not in

    (

    select min(RecordId)

    from AssociationLinks

    group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus

    having count(*)>1

    )

    I tried this with a million randomly generated rows of which 90% are duplicates and it takes about 40 seconds. You can lower this by putting the select into a temporary table and generating a unique index on RecordId.

  • The problem is that both pieces of code will work very fast when there's only a couple thousand dupes and that's the the case here... George said that there's 6.3 Million rows and about 3/4 of them a dupes which means that a little over 4.5 Million rows must be deleted.

    There's a point where SQL Server just looses it's mind on Updates and Deletes and it seems to occur at a little over 2 Million rows on my machine. I call it the "tipping" point. It's symptoms are where it might only take 40 seconds to update or delete a Million rows and only 80 seconds to update or delete 2 Million rows, but try 3 Million rows (for example) and the code suddenly takes several days to run. I've not been able to narrow down the cause, but I think the computer gets driven into a "swap file" mode of sorts.

    The key to deleting that many rows is not in the detection of the rows (although that is certainly important), but in the number of rows being deleted. With that in mind, I recommend a "Delete Crawler" that deletes the rows in chunks... it's one of the very few places where I'll use a loop.

    Here's an example "Delete Crawler" from another time I had to do such a thing... you might also want to truncate the transaction log at the end of each iteration using Backup with the Truncate_Only option... the code also build it's own test data so you can "play" until you think you have it right... PLEASE... READ THE COMMENTS AND DON'T TEST THIS (or anything, for that matter) ON A PRODUCTION BOX! Change the "DELETE" near the end of the code to simulate your requirements.

    --===== If the test table exists, drop it

    IF OBJECT_ID('dbo.JBMTestDetail','U') IS NOT NULL

    DROP TABLE dbo.JBMTestDetail

    GO

    --===== Create and populate a 5,000,000 row test table.

    -- Column "ConnID" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)

    -- Column "Key1" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "Time_Stamp" has a range of >=01/01/2005 and <01/01/2015 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    SELECT TOP 5000000 --Takes about 5 minutes to build this... start at a million, if you want

    ConnID = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(30)),''), --(10 rows per connection)

    Key1 = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(20)),''), --just to test index with

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    Time_Stamp = ISNULL(CAST(RAND(CHECKSUM(NEWID()))*3652.0+38351.0 AS DATETIME),0),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTestDetail

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== Create indexes similar to Troy's

    CREATE CLUSTERED INDEX IXC_JBMTestDetail_Time_Stamp_ConnID ON dbo.JBMTestDetail (Time_Stamp,ConnID)

    CREATE NONCLUSTERED INDEX IX_JBMTestDetail_ConnID_Key1 ON dbo.JBMTestDetail (ConnID,Key1)

    GO

    --===== Setup to measure performance...

    SET STATISTICS TIME ON

    --===== Define the cutoff date with a time of "midnight" or, if you will,

    -- define the cutoff date with no time so we only delete whole days.

    DECLARE @CutoffDate DATETIME

    SELECT @CutoffDate = DATEADD(dd,DATEDIFF(dd,0,GETDATE()),-94)

    --===== Limit all further queries, including deletes, to 25,000 rows

    -- (about 1 second worth of deletes, like I said before)

    SET ROWCOUNT 25000 --You can make this bigger, but I wouldn't do more than a million

    --===== See if any rows qualify for deletion. If even just one exists,

    -- then there's work to do and @@ROWCOUNT will be > 0.

    -- DO NOT PUT ANY CODE BETWEEN THIS SELECT AND THE WHILE LOOP OR

    -- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP

    SELECT TOP 1 1 FROM dbo.JBMTestDetail WHERE Time_Stamp < @CutoffDate

    --===== If the rowcount from the above is greater than 0,

    -- then delete 25,000 rows at a time until there's nothing

    -- left to delete

    WHILE @@ROWCOUNT > 0

    BEGIN

    --===== Just a "marker" to separate the loop in the output

    PRINT REPLICATE('=',78)

    --===== This delay gives other processes breathing room

    WAITFOR DELAY '00:00:10'

    --===== Do the delete. Will be limited by the SET ROWCOUNT above.

    -- DO NOT PUT ANY CODE BETWEEN THIS DELETE AND THE "END" OR

    -- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP.

    -- Change this part to match your requirements

    DELETE dbo.JBMTestDetail WITH (TABLOCKX)

    WHERE Time_Stamp < @CutoffDate

    END

    --===== Restore the ability to process more than 25,000 rows

    SET ROWCOUNT 0

    SELECT 1000000 - COUNT(*) FROM jbmtestdetail

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oh yeah... almost forgot... Like you said, George, it will likely be cheaper to insert good rows into a new table because you won't have indexes and the like to update with a delete. The sub-query in Micheals code would be very good for determining which rows to copy.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • thankyou both.

    Just to re-cap Jeff, you reckon best option to use Michaels subquery to select out to a new table, and will be faster than the delete? That will give me a temp table (say) with all correct recordids and then I can select rows out from the table where recordid in (select recordid from temp table)?

    Happy to try both and get back with figures, practice will be good for me. I will be testing out on our UAT box.

    BTW Any top tips on how to think set based? My focus is production DBA and my coding heyday was in the days of cobol batch file processing, which I was good at, but never really cut it as an OLTP programmer. I need to do more of this stuff as there is more credit in curing problems and you can only go so far with improving indexing strategy and physical design.

    ---------------------------------------------------------------------

  • george sibbald (5/17/2008)


    thankyou both.

    Just to re-cap Jeff, you reckon best option to use Michaels subquery to select out to a new table, and will be faster than the delete? That will give me a temp table (say) with all correct recordids and then I can select rows out from the table where recordid in (select recordid from temp table)?

    If you do it right, the new table will have full records. Michaels subquery will give you all the correct RecordID's... use that as a driver for an outer query to do a SELECT/INTO on another table.

    BTW Any top tips on how to think set based?

    Yes... first step is to start thinking about what you want to do to a column instead of what you want to do to each row. It's a huge paradigm shift for most people, but if you just forget about the rows and start thinking about columns, someday soon, the lightbulb will go off and you'll ask yourself "This is easy... why haven't I done it this way all along"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ... like this...

    SELECT d.*

    INTO new_table_name

    FROM (--=== Michael's subquery with a slight mod

    select min(RecordId)

    from dbo.AssociationLinks

    group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus)d

    -- having count(*)>1 don't need this anymore

    INNER JOIN dbo.AccociationLinks al

    ON al.RecordID = d.RecordID

    If you don't have any foreign keys on the original table, you should just be able to truncate it after the new table has been verified, and copy the rows back in. Or, drop the old table and rename the new table and add the indexes/constraints back to it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, cheers mate, code as well!

    I like it, all in one step with no temp table stage.

    I think I get it. select min recordid for each group of duplicates, join it with itself and output that row to new table (inner join needed because I want all the COLUMNS output associated with the min recordid COLUMN)

    I'll try this out when I get back to the office monday, and try to think about code problems from a column perspective.

    george

    ......and thanks michael for the find duplicates input

    ---------------------------------------------------------------------

  • First of all, the delete statement I posted has a slight problem. It works correctly for the duplicates and would delete the duplicates and leave only one row with the lowest RecordId. But at the same time it would also delete all the unique rows. Thus the having-clause must be removed for this delete statement - as Jeff noted - sort of.

    Jeff, as for the performance issue when deleting a large subset of rows from a large table, I cannot at all confirm this. I really don't understand why. Thus I ran your code and generated 5 million rows followed by doing a

    delete from JBMTestDetail where Time_Stamp>='20060101' and Time_Stamp<='20131231'

    which deletes 80% of the rows in about 6 minutes.

    I also generated random data in George's table with 6 million rows of which 87% are duplicates and my delete statement takes 3.5 minutes. By changing the sub-select to a real table with a unique index I can get it down to 2.5 minutes.

    I tried this on SS2000PE/SP4 and SS2005SE.

    By the way, on SS2005 you must not have a unique clustered index on your table, otherwise generating data using newid() produces the same value for all columns of each row. Thus try the following:

    drop table t

    create table t(

    i int identity(1,1),

    f1 int,

    f2 int)

    create unique clustered index t_1 on t(i)

    go

    insert into t(f1,f2)

    select top 1000 abs(checksum(newid())),abs(checksum(newid()))

    from master.dbo.syscolumns t1,master.dbo.syscolumns t2

    go

    select * from t

  • Michael,

    thanks for the info. so the delete should be:

    delete from AssociationLinks

    where RecordId not in

    (

    select min(RecordId)

    from AssociationLinks

    group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus

    )

    and the having count(*)>1 is not required because the condition is where recordid NOT in

    , so has a count of 1, correct?

    I can try all 3 methods, select out, this delete and crawler delete. Would not any large delete need to be put into a loop to batch it up though, to avoid blowing the tran log?

    BTW, just noticed a small typo in my original post, the tale size is 800MB, not 80 (inc. indexes)

    ---------------------------------------------------------------------

  • If your transaction log is 'going to blow' it will do so with or without the delete crawler. Nevertheless, check your db recovery mode setting.

  • Michael Meierruth (5/18/2008)


    If your transaction log is 'going to blow' it will do so with or without the delete crawler. Nevertheless, check your db recovery mode setting.

    Good catch... I should have said that it will keep the log file from "blowing" if the recovery mode is set to simple. My appologies for the omission. Thanks, Michael.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just my thoughts...

    -- Get lowest ID's (whether dups exist or not)

    SELECTMIN(RecordID) as 'RecordID'

    INTO#KeptIDs

    FROMAssociationLinks

    GROUP BY

    WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus

    -- Snapshot all cols for these ID's

    SELECT*

    INTO#KeptData

    FROMAssociationLinks

    WHERERecordID IN (SELECT RecordID FROM #KeptIDs)

    -- Clear table

    TRUNCATE TABLEAssociationLinks

    -- Re-populate table with "clean" data

    -- nb you'll need to toggle IDENTITY INSERT depending on table settings

    INSERT INTOAssociationLinks

    SELECTRecordID,

    WorkspaceEntityId,

    SourceEntityId,

    TargetEntityId,

    Name,

    RecordStatus

    FROM#KeptData

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply