Removing duplicates and keeping 1 record with no PK

Question

Removing duplicates and keeping 1 record with no PK

olie480

SSC-Addicted

Points: 443
More actions
April 2, 2010 at 3:03 pm

#220031

I have an issue with my database with multiple duplicate customers that I need to get rid of. Basically, it would be easy to do a query based on a Primary Key if the primary keys were duplicates, but the problem I am facing is that the duplicates are being originated based on two columns, FirstName and LastName.
I was thinking that since the City column is pretty much identical to the persons FirstName and LastName I can probably use that as some sort of key to delete the duplicates successfully, but I haven't the know how on how to do this.
So for example:
CustNumber | FirstName | LastName | City
345321 | John | Doe | Kellogg
459978 | John | Doe | Kellogg
127642 | John | Doe | Kellogg
I would like to just keep one of these, it doesn't matter which one, but just keep one. Do I somehow Union columns together? or use AS clauses?
I was just going to go through the 7,000 records, and just place a number 5 on the column that doesn't really matter, then do a DELETE * FROM Customer WHERE TaxNumber = 5;
But I figured I'd ask the professionals how I can get away with this.
Thanks in advance for your help!

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Steve Cullen SSCertifiable Points: 5598 More actions · Answer 1

Lots of reasons to have keys, this being another good one.

What you could do is save your 1 record you wish to preserve to a temp table, then delete all the dups, then insert the keepers back into the table. You can use something like:

select firstname, lastname, address from table1, group by firstname, lastname, address having count(*) > 1

Then you delete those duplicates, then re-insert from your temp table into your original table.

Converting oxygen into carbon dioxide, since 1955.

LutzM SSC Guru Points: 107049 More actions · Answer 2

You could use Row_Number() to find the dupes...

Something like this:

DECLARE @tbl TABLE

(

CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

)

INSERT INTO @tbl

SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

SELECT 127642 ,'John','Doe','Kellogg'

;WITH cte AS

(

SELECT

CustNumber,

ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

FROM @tbl

)

SELECT t.* -- replace with DELETE once the query is checked to give the correct results

FROM @tbl t

INNER JOIN cte ON t.CustNumber=cte.CustNumber

WHERE ROW<>1

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281334 More actions · Answer 3

lmu92 (4/2/2010)
You could use Row_Number() to find the dupes...
Something like this:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.* -- replace with DELETE once the query is checked to give the correct results
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1

Quite similar to what I would have suggested.

This is a good method.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

olie480 SSC-Addicted Points: 443 More actions · Answer 4

lmu92 (4/2/2010)
You could use Row_Number() to find the dupes...
Something like this:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.* -- replace with DELETE once the query is checked to give the correct results
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1

I tried using this in the query manager, but all it says it 4 rows affected, and no rows show up in the status window. Just to let you know too, that the only things that I delete were the " -- replace with DELETE once the query is checked to give the correct results"

So the code I used was:

DECLARE @tbl TABLE

(

CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

)

INSERT INTO @tbl

SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

SELECT 127642 ,'John','Doe','Kellogg'

;WITH cte AS

(

SELECT

CustNumber,

ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

FROM @tbl

)

SELECT t.*

FROM @tbl t

INNER JOIN cte ON t.CustNumber=cte.CustNumber

WHERE ROW<>1

I will have you know that I have more than just those 3 fields within the table too. I think there are about 30 more fields or so. I just wanted to let you know what are the duplicates.

LutzM SSC Guru Points: 107049 More actions · Answer 5

That's weird...

Here's the result in the Results window when running the modified query as you posted:

CustNumberFirstNameLastNameCity

345321JohnDoeKellogg

459978JohnDoeKellogg

Message window output:

(4 row(s) affected)

(2 row(s) affected)

The total number of columns doesn't really matter as long as all the "dupe candidates" are included in the PARTITION BY clause of the ROW_NUMBER function and CustNumber identifies one row.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

LutzM SSC Guru Points: 107049 More actions · Answer 6

Btw: What do you mean by "query manager"?

If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.

On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

Edit: table variable are available in SS2K. So the code would run partially in SS2K.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281334 More actions · Answer 7

lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?
If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

I think it is SQL 2K.

4 rows inserted and then probably an error message.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

olie480 SSC-Addicted Points: 443 More actions · Answer 8

lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?
If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
Edit: table variable are available in SS2K. So the code would run partially in SS2K.

I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.

When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.

I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?

Thanks for the help again guys!

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 9

Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

olie480 SSC-Addicted Points: 443 More actions · Answer 10

Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...

Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...

I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.

I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.

olie480 SSC-Addicted Points: 443 More actions · Answer 11

Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...

Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...

I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.

I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.

SQLRNNR SSC Guru Points: 281334 More actions · Answer 12

olie480 (4/2/2010)
lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?
If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
Edit: table variable are available in SS2K. So the code would run partially in SS2K.
I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.
When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.
I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?
Thanks for the help again guys!

There was a table variable and a CTE. I think you should probably go look those up in BOL.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

LutzM SSC Guru Points: 107049 More actions · Answer 13

olie480 (4/2/2010)
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.
...
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
...

The method or software you're using isn't relevant nor is the amount of money you spent on it.

What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.

Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?

Once this business rule is set in stone, the SQL stuff becomes rather easy...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 14

lmu92 (4/3/2010)
olie480 (4/2/2010)
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.
...
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
...
The method or software you're using isn't relevant nor is the amount of money you spent on it.
What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.
Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?
Once this business rule is set in stone, the SQL stuff becomes rather easy...

Exactly.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)