best way to remove duplicates when no primary key or date present

Question

Post reply

best way to remove duplicates when no primary key or date present

sqlstar2011

SSCertifiable

Points: 6000
More actions
April 29, 2008 at 7:34 am

#380241

I would like to remove duplicates from a table where no primary key nor a date is present. you can use the following code to create some sample data
create table SC (CityName varchar(10), Zip Varchar(10), Description varchar(20))
insert into SC values ('Bank','22222','N/A')
insert into SC values ('Bank','22222','Here in town')
insert into SC values ('k','22222','N/A')
insert into SC values ('k','22222','N/A')
insert into SC values ('k','22222','Here in town')
insert into SC values ('Karachi','22222','Here in town')
insert into SC values ('Lahore','22222','Here in town')
thanks

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 1

So, which columns constitute a dupe? Just the first two or all of them?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 2

The SOP way is something like:

Select distinct * into #tmp From SC

Truncate Table SC

Insert into SC Select * from #tmp

Drop table #tmp

You also might want to wrap this in a transaction.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

GSquared SSC Guru Points: 260824 More actions · Answer 3

The only actual duplicates I see are:

insert into SC values ('k','22222','N/A')

RBarry's method will handle those two. The rest aren't strict duplicates. If you want to treat:

insert into SC values ('Bank','22222','N/A')

insert into SC values ('Bank','22222','Here in town')

as duplicates (for example), you'll have to define what constitutes a "duplicate", and how to determine what to do with the slightly different data.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Red8Rain Hall of Fame Points: 3312 More actions · Answer 4

you can use the group by statement to find your dupes and based on the group by, write a delete statement to nuke it.

you can also use the OUTPUT function to output your deletes to a temp table so that you can later review. Makes it easier to recover data if your code fubarred.

-----------------------------
www.cbtr.net
.: SQL Backup Admin Tool[/url] :.

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 5

Not if the dupes encompass the key for the group by like this one does...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RyanRandall SSChampion Points: 13623 More actions · Answer 6

Another option...

delete a from

(select *, row_number() over (partition by CityName, Zip, Description order by CityName) as x from SC) a

where x > 1

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

The problem with the phrase "Remove Duplicates" is that the speaker *usually* means is "one of them is 'real', so the others that are like it are the duplicates, so I want you to remove the (additional) duplicates and leave the original". This is confusing because in both SQL and Logic, they are ALL duplicates.

Which means that the obvious {DELETE .. Where Group By Count > 1} won't work because that also deletes the "originals" along with the "duplicates".

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RyanRandall SSChampion Points: 13623 More actions · Answer 8

rbarryyoung (4/30/2008)
The problem with the phrase "Remove Duplicates" is that the speaker *usually* means is "one of them is 'real', so the others that are like it are the duplicates, so I want you to remove the (additional) duplicates and leave the original". This is confusing because in both SQL and Logic, they are ALL duplicates.
Which means that the obvious {DELETE .. Where Group By Count > 1} won't work because that also deletes the "originals" along with the "duplicates".

Is this aimed at my suggestion?

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 9

RyanRandall (4/30/2008)
rbarryyoung (4/30/2008)
...
Is this aimed at my suggestion?

Oberion's, I think.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RyanRandall SSChampion Points: 13623 More actions · Answer 10

rbarryyoung (4/30/2008)
RyanRandall (4/30/2008)
rbarryyoung (4/30/2008)
...
Is this aimed at my suggestion?
Oberion's, I think.

OK - just checking 🙂

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Q -631159 Ten Centuries Points: 1077 More actions · Answer 11

Add a column that creates a unique id per row. Select the minimum value into a table for this new column based on the columns that will make the row unique. Delete the rest of the rows doing an inner join to the minimum value table.

Alter table table1

add newcol int identity(1,1)

Select col1, col2, col3, ..., min(newcol) as minvalues

into #temp

from table1

group by col1, col2, col3, ...

Delete from table1

from #temp

where #temp.minvalues <> table1.newcol

and #temp.col1 = table1.col1

and #temp.col2 = table1.col2

and #temp.col3 = table1.col3 ...

Q

Please take a number. Now serving emergency 1,203,894

RyanRandall SSChampion Points: 13623 More actions · Answer 12

Q (5/1/2008)
Add a column that creates a unique id per row. Select the minimum value into a table for this new column based on the columns that will make the row unique. Delete the rest of the rows doing an inner join to the minimum value table.
Alter table table1
add newcol int identity(1,1)
Select col1, col2, col3, ..., min(newcol) as minvalues
into #temp
from table1
group by col1, col2, col3, ...
Delete from table1
from #temp
where #temp.minvalues <> table1.newcol
and #temp.col1 = table1.col1
and #temp.col2 = table1.col2
and #temp.col3 = table1.col3 ...

Why do all that when you can just do this? Is there some advantage that you know of?

delete a from

(select *, row_number() over (partition by CityName, Zip, Description order by CityName) as x from SC) a

where x > 1

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Q -631159 Ten Centuries Points: 1077 More actions · Answer 13

Ryan,

I like your solution. I am not sure that the different solution has an advantage. It was just my preference. I have been using similar code for years for others to perform this task. It is simple, sets up a column so every row now is unique, and someone with very little T-sql can read it.

Yours is simple and does not require a physical change to the table.

Q

Please take a number. Now serving emergency 1,203,894

RyanRandall SSChampion Points: 13623 More actions · Answer 14

Okay - thanks. I was just checking I wasn't missing something 🙂

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.