Deleting Duplicate Data

  • This is probably a simple question, but how do I delete duplicate data in a single table.

    I used this query to identify the duplicates but how do I safely remove the "duplicate" without deleting the "original" so to speak.

    SELECT (tblVDN_BUCKET.VDN) AS [VDN Field], Count(tblVDN_BUCKET.VDN) AS NumberOfDups

    FROM tblVDN_BUCKET

    GROUP BY tblVDN_BUCKET.VDN

    HAVING (((Count(tblVDN_BUCKET.VDN))>1))

    Thanks,

    Brian

  • If the entire row is duplicate you can do:

    select distinct * into tblVDN_BUCKET2 from tblVDN_BUCKET

    drop table tblVDN_BUCKET

    Be sure to drop/create keys/indexes/etc.

    If you have another unique key field, or could add an identity UniqueKey, you could also do:

    delete tblVDN_BUCKET

    from tblVDN_BUCKET a left outer join

    (SELECT tblVDN_BUCKET.VDN, min(UniqueKey)

    FROM tblVDN_BUCKET

    GROUP BY tblVDN_BUCKET.VDN

    HAVING Count(1)>1) b a.vdn = b.vdn and a.UniqueKey = b.UniqueKey

    where b.UniqueKey is null

  • Create table Dups (pk int identity, Foo char(1))

    Insert into Dups (foo)

    Select 'a'

    union all

    Select 'b'

    union all

    Select 'c'

    union all

    Select 'd'

    union all

    Select 'e'

    union all

    Select 'a'

    union all

    Select 'c'

    union all

    Select 'e'

    -- Shows each combination of a duplicate

    select *

    from dups a

    join dups b on a.foo = b.foo

    where a.pk <> b.pk

    -- Get the max pk of each dup

    select max(a.pk)

    from dups a

    join dups b on a.foo = b.foo

    where a.pk <> b.pk

    group by a.foo

    -- Put in in clause of delete statement

    Delete Dups

    where pk in (-- Get the max pk of each dup

                 select max(a.pk)

                 from dups a

                 join dups b on a.foo = b.foo

                 where a.pk <> b.pk

                 group by a.foo

                 )

    select * from dups

    drop table dups

  • Unfortunately this query

    Delete Dups

    where pk in (-- Get the max pk of each dup

                 select max(a.pk)

                 from dups a

                 join dups b on a.foo = b.foo

                 where a.pk <> b.pk

                 group by a.foo

                 )

    assumes that you will have only 2 repeated records for 3 or more you will need to repeat the statement several times

     


    * Noel

  • Or use "not in" instead of "in"

    My suggestion would be the following:

    delete dups where pk != (select min(dups2.pk) from dups dups2 where dups2.foo = dups.foo)

    but only because I think it is easier to read

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply