Deleting duplicates from 37 million Records

  • Hi All,

    We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails

    But other fields are different.

    Below is the way we have in our mind, Pls suggest me better than this.

    Insert into New

    select * from [37million]

    where Rid in(select max(Rid) from [37million] group by email)

    Pls suggest better than this..............

    Thnx in advance

  • We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails

    But other fields are different.

    Insert into New

    select * from [37million]

    where Rid in(select max(Rid) from [37million] group by email)

    You'll need to do a select for distinct and count where the same address > 1...

    select distinct from [37million]

    group by

    having count(distinct) > 1

    note - this will identify all the emails appearing more than once, you will need an extra line of filtering when deleting it or inserting into, or else you'll move all the qualifying rows, not only the duplicates.

    Use your RID to identify the rows to move...

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

  • Deleting duplicates in that will take long time i guess,

    so i want create a fresh table with all the records including non duplicates. once it got inserted , after that i'll delete that table.

    Please suggest me the best way to do that

  • Insert into New

    select * from [37million]

    where Rid in(select max(Rid) from [37million] group by email)

    or

    DELETE FROM [Table]

    WHERE [RID] IN

    (

    SELECT a.[RID]

    FROM [Table] a,

    [Table] b

    WHERE a.[RID]!= b.[RID]

    and a.[RID]< b.[RID]

    AND a.[Email]= b.[Email]

    )

    Please tell me which one is better among the above two.......

  • Can you provide a partial DDL for the table, primary key(s) and e-mail columns, and some sample data for the table. Also, how do you want to identify which record of the duplicate e-mail address do you want to keep?

  • Sample Data is

    Id FirstName LastName Address1 Address2 City State Zip .......

    20+ columns, Emailaddress

    all having datatype nvarchar2(255) except ID Column

    Any emailaddress is Ok No priority like that.......

  • with EmailRecords as (

    select

    row_number() over (partition by Emailaddress order by Id desc) as RowNumber -- Assuming Highest Id is most recent

    Id,

    FirstName,

    LastName[,...]

    from

    dbo.[37MillionRecTable]

    )

    insert into dbo.[NewTable]

    select

    Id,

    FirstName[,...] -- Note, do not include RowNumber

    from

    EmailRecords

    where

    RowNumber = 1;

    /*

    After running this, you can drop the original table then rename [NewTable] to the original table name.

    */

    The above code is untested and truly incomplete as you didn't really provide us with anything to allow testing.

  • How many duplicates do you have?

    This might do it for you. set @rcnt toi the number of duplicates.

    I know this is a cursor. But I do come from the oracle world

    You will need to change it for your tables but I just used this to delete over 600 duplicates in 600,000 records. You can use @@recordcount and not use the @rcnt.

    declare @rcntint

    select @rcnt = 700

    While @rcnt > 0

    BEGIN

    delete top (1)

    from tmp_industry_stock

    where customer in (Select Customer

    from tmp_industry_stock

    group BY Country,Sector,industry,Customer

    having count(*) > 1);

    select @rcnt = @rcnt - 1;

    END

  • LOL had same issue (~100M) .... BUT do you have a proper way to say WHICH of those records are the VALID ones? It might not be the MAX(ID) take that into consideration...

    Solution I used was to delete in batches the duplicates (not sure what servers other pple have but mine would just crawl if I'd try to move such big amounts in 1 transaction on live and the log file would just blow). For the starters try to remove the duplicates from every X MILS records (note that duplicate emails won't be too apart ID's speaking pple have the tendency to double register in short period of times not too far apart anyway, this is assuming no index on email column). When your done with this type of removing duplicates either go bigger XX or try the whole table. Watch your performance indicators before deciding to go bigger 🙂


    Kindest Regards,

    Vasc

  • bwilliams:

    1) welcome to sql server and the sqlservercentral forums

    2) PLEASE PLEASE PLEASE do yourself a favor and learn set-based operations. that code you posted would be an unbelievable DOG on sql server. massively inefficient compared to how sql server best operates. you are in a different world now and when in Rome . . . 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply