I need to de-duplicate a large table (lots of full-row dups)

  • table has 500 million rows. There are 21 million rows that have one ore more fully duplicate ones, totalling in 70 million duplicate rows altogether.

    Can someone recommend (T-SQL) a good way to DEDUPLICATE THIS TABLE without having to create another huge table (reloading dup data into it..) ? Is there a way to do it at the ONE TABLE level,?

    not creating other perm tables to hold data etc?

    Thanks!

    Likes to play Chess

  • Take a look at this post

    http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/

    You could easily set this up to batch through your table at 10,000 records at a time. I'd probably just set it up in a job and set the job to run every few minutes.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I would use something like this: with DupTable as (

    select ID, ROW_NUMBER() over (partition by ID order by ID) as rn

    from Table

    group by ID

    having count(*) > 1)

    delete DupTable

    where rn > 1

    I wouldn't do the whole table at once since that isn't very tlog friendly. You can add a select top n to the cte.



    Microsoft Certified Master - SQL Server 2008
    Follow me on twitter: @keith_tate

    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Jason was too quick for me. Guess that's what I get for typing out my code instead of having a blog post ready to go. 😀



    Microsoft Certified Master - SQL Server 2008
    Follow me on twitter: @keith_tate

    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Groundhog Day, Keith. He's seen this once or twice before 😉


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Keith Tate (2/11/2014)


    Jason was too quick for me. Guess that's what I get for typing out my code instead of having a blog post ready to go. 😀

    Hahaha

    I still had to go find the URL 😉

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • ChrisM@home (2/11/2014)


    Groundhog Day, Keith. He's seen this once or twice before 😉

    Yeah something like that.

    If memory serves, we had this same discussion with this OP just last week.:Whistling:

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply