best way to remove duplicates when no primary key or date present

  • I would like to remove duplicates from a table where no primary key nor a date is present. you can use the following code to create some sample data

    create table SC (CityName varchar(10), Zip Varchar(10), Description varchar(20))

    insert into SC values ('Bank','22222','N/A')

    insert into SC values ('Bank','22222','Here in town')

    insert into SC values ('k','22222','N/A')

    insert into SC values ('k','22222','N/A')

    insert into SC values ('k','22222','Here in town')

    insert into SC values ('Karachi','22222','Here in town')

    insert into SC values ('Lahore','22222','Here in town')

    thanks

  • So, which columns constitute a dupe? Just the first two or all of them?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The SOP way is something like:

    Select distinct * into #tmp From SC

    Truncate Table SC

    Insert into SC Select * from #tmp

    Drop table #tmp

    You also might want to wrap this in a transaction.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • The only actual duplicates I see are:

    insert into SC values ('k','22222','N/A')

    insert into SC values ('k','22222','N/A')

    RBarry's method will handle those two. The rest aren't strict duplicates. If you want to treat:

    insert into SC values ('Bank','22222','N/A')

    insert into SC values ('Bank','22222','Here in town')

    as duplicates (for example), you'll have to define what constitutes a "duplicate", and how to determine what to do with the slightly different data.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • you can use the group by statement to find your dupes and based on the group by, write a delete statement to nuke it.

    you can also use the OUTPUT function to output your deletes to a temp table so that you can later review. Makes it easier to recover data if your code fubarred.

    -----------------------------
    www.cbtr.net
    .: SQL Backup Admin Tool[/url] :.

  • Not if the dupes encompass the key for the group by like this one does...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Another option...

    delete a from

    (select *, row_number() over (partition by CityName, Zip, Description order by CityName) as x from SC) a

    where x > 1

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • The problem with the phrase "Remove Duplicates" is that the speaker *usually* means is "one of them is 'real', so the others that are like it are the duplicates, so I want you to remove the (additional) duplicates and leave the original". This is confusing because in both SQL and Logic, they are ALL duplicates.

    Which means that the obvious {DELETE .. Where Group By Count > 1} won't work because that also deletes the "originals" along with the "duplicates".

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (4/30/2008)


    The problem with the phrase "Remove Duplicates" is that the speaker *usually* means is "one of them is 'real', so the others that are like it are the duplicates, so I want you to remove the (additional) duplicates and leave the original". This is confusing because in both SQL and Logic, they are ALL duplicates.

    Which means that the obvious {DELETE .. Where Group By Count > 1} won't work because that also deletes the "originals" along with the "duplicates".

    Is this aimed at my suggestion?

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • RyanRandall (4/30/2008)


    rbarryyoung (4/30/2008)


    ...

    Is this aimed at my suggestion?

    Oberion's, I think.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (4/30/2008)


    RyanRandall (4/30/2008)


    rbarryyoung (4/30/2008)


    ...

    Is this aimed at my suggestion?

    Oberion's, I think.

    OK - just checking 🙂

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Add a column that creates a unique id per row. Select the minimum value into a table for this new column based on the columns that will make the row unique. Delete the rest of the rows doing an inner join to the minimum value table.

    Alter table table1

    add newcol int identity(1,1)

    Select col1, col2, col3, ..., min(newcol) as minvalues

    into #temp

    from table1

    group by col1, col2, col3, ...

    Delete from table1

    from #temp

    where #temp.minvalues <> table1.newcol

    and #temp.col1 = table1.col1

    and #temp.col2 = table1.col2

    and #temp.col3 = table1.col3 ...

    Q

    Please take a number. Now serving emergency 1,203,894

  • Q (5/1/2008)


    Add a column that creates a unique id per row. Select the minimum value into a table for this new column based on the columns that will make the row unique. Delete the rest of the rows doing an inner join to the minimum value table.

    Alter table table1

    add newcol int identity(1,1)

    Select col1, col2, col3, ..., min(newcol) as minvalues

    into #temp

    from table1

    group by col1, col2, col3, ...

    Delete from table1

    from #temp

    where #temp.minvalues <> table1.newcol

    and #temp.col1 = table1.col1

    and #temp.col2 = table1.col2

    and #temp.col3 = table1.col3 ...

    Why do all that when you can just do this? Is there some advantage that you know of?

    delete a from

    (select *, row_number() over (partition by CityName, Zip, Description order by CityName) as x from SC) a

    where x > 1

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Ryan,

    I like your solution. I am not sure that the different solution has an advantage. It was just my preference. I have been using similar code for years for others to perform this task. It is simple, sets up a column so every row now is unique, and someone with very little T-sql can read it.

    Yours is simple and does not require a physical change to the table.

    Q

    Please take a number. Now serving emergency 1,203,894

  • Okay - thanks. I was just checking I wasn't missing something 🙂

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply