Removing Duplicate Records Efficiently

  • I periodically get 12 million records with names, addresses and phone numbers. In the data there are some duplicate phone numbers. Where there are duplicates, I only want one of the records - which one does not matter. Is there an efficient way to get rid of the dups without running a cursor?

    Thanks,

    Dan

  • Depending on how you get the records and what fields are duplicated you have multiple options. How many columns must have the same data to be considered a duplicate. If all then SELECT DISTINCT will handle. If less then is there any existing field such as a date which determines which to keep or is it just keep one. If keep one then setup a processing table with an IDENTITY column to key off of.

    Then query something like this.

    SELECT * FROM tblProc oq WHERE IDX = (SELECT TOP 1 IDX FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3 ORDER BY IDX)

    -- Basically where the duplicating fields are equal.

    Now if you have a date or other unique column and want to keep the last record query like this on the data (again if coming from another server may not need processing table)

    SELECT * FROM tblProc oq WHERE ColDate = (SELECT Max(ColDate) FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3)

    It all boils down to how you decide what to keep and what to throw away.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • A quick and dirty alternative is to make a copy of the table structure and create a unique index with ignore dupes on your key field(s). Then append the data, applying an order by that makes sense (keep newest, keep oldest, etc).

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • Hi,

    Maybe you can try this if you want to avoid creating any temp table or using any cursor -->

    SET ROWCOUNT 1

    SELECT 'Start Removing Duplicate Record...' -- Don't remove, seed record

    WHILE (@@ROWCOUNT>=1)

    BEGIN

    DELETE TABLE_1 WHERE COL_1 in ( SELECT COL_1 FROM TABLE_1 GROUP BY COL_1 HAVING COUNT(*) > 1)

    END

    SET ROWCOUNT 0

    SELECT * FROM TABLE_1

    Basically it just a WHILE..LOOP process to keep scanning TABLE_1, until there is not more duplicate record, @@ROWCOUNT will return 0 and the process will stop.

    Another alternative/dirty way which I used to do that before I came out with this script is :

    SELECT DISTINCT * INTO TEMP_TABLE FROM TABLE_1

    TRUNCATE TABLE TABLE_1

    INSERT INTO TABLE_1 SELECT * FROM TEMP_TABLE

    I find this still the simplest way, but it required high IO resources.

    Hope this help!!!

    or you can reach me at :

    ywmeng@rocketmail.com

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply