April 2, 2010 at 3:03 pm
I have an issue with my database with multiple duplicate customers that I need to get rid of. Basically, it would be easy to do a query based on a Primary Key if the primary keys were duplicates, but the problem I am facing is that the duplicates are being originated based on two columns, FirstName and LastName.
I was thinking that since the City column is pretty much identical to the persons FirstName and LastName I can probably use that as some sort of key to delete the duplicates successfully, but I haven't the know how on how to do this.
So for example:
CustNumber | FirstName | LastName | City
345321 | John | Doe | Kellogg
459978 | John | Doe | Kellogg
127642 | John | Doe | Kellogg
I would like to just keep one of these, it doesn't matter which one, but just keep one. Do I somehow Union columns together? or use AS clauses?
I was just going to go through the 7,000 records, and just place a number 5 on the column that doesn't really matter, then do a DELETE * FROM Customer WHERE TaxNumber = 5;
But I figured I'd ask the professionals how I can get away with this.
Thanks in advance for your help!
April 2, 2010 at 3:31 pm
Lots of reasons to have keys, this being another good one.
What you could do is save your 1 record you wish to preserve to a temp table, then delete all the dups, then insert the keepers back into the table. You can use something like:
select firstname, lastname, address from table1, group by firstname, lastname, address having count(*) > 1
Then you delete those duplicates, then re-insert from your temp table into your original table.
April 2, 2010 at 3:31 pm
You could use Row_Number() to find the dupes...
Something like this:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.* -- replace with DELETE once the query is checked to give the correct results
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1
April 2, 2010 at 3:39 pm
lmu92 (4/2/2010)
You could use Row_Number() to find the dupes...Something like this:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.* -- replace with DELETE once the query is checked to give the correct results
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1
Quite similar to what I would have suggested.
This is a good method.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
April 2, 2010 at 4:38 pm
lmu92 (4/2/2010)
You could use Row_Number() to find the dupes...Something like this:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.* -- replace with DELETE once the query is checked to give the correct results
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1
I tried using this in the query manager, but all it says it 4 rows affected, and no rows show up in the status window. Just to let you know too, that the only things that I delete were the " -- replace with DELETE once the query is checked to give the correct results"
So the code I used was:
DECLARE @tbl TABLE
(
CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)
)
INSERT INTO @tbl
SELECT 345321 ,'John','Doe','Kellogg' UNION ALL
SELECT 459978 ,'John','Doe','Kellogg' UNION ALL
SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL
SELECT 127642 ,'John','Doe','Kellogg'
;WITH cte AS
(
SELECT
CustNumber,
ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW
FROM @tbl
)
SELECT t.*
FROM @tbl t
INNER JOIN cte ON t.CustNumber=cte.CustNumber
WHERE ROW<>1
I will have you know that I have more than just those 3 fields within the table too. I think there are about 30 more fields or so. I just wanted to let you know what are the duplicates.
April 2, 2010 at 5:07 pm
That's weird...
Here's the result in the Results window when running the modified query as you posted:
CustNumberFirstNameLastNameCity
345321JohnDoeKellogg
459978JohnDoeKellogg
Message window output:
(4 row(s) affected)
(2 row(s) affected)
The total number of columns doesn't really matter as long as all the "dupe candidates" are included in the PARTITION BY clause of the ROW_NUMBER function and CustNumber identifies one row.
April 2, 2010 at 5:12 pm
Btw: What do you mean by "query manager"?
If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
Edit: table variable are available in SS2K. So the code would run partially in SS2K.
April 2, 2010 at 5:15 pm
lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
I think it is SQL 2K.
4 rows inserted and then probably an error message.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
April 2, 2010 at 8:23 pm
lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
Edit: table variable are available in SS2K. So the code would run partially in SS2K.
I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.
When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.
I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?
Thanks for the help again guys!
April 2, 2010 at 8:27 pm
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 2, 2010 at 11:20 pm
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.
I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...
Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...
I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.
April 2, 2010 at 11:20 pm
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.
I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...
Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...
I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.
April 3, 2010 at 2:03 am
olie480 (4/2/2010)
lmu92 (4/2/2010)
Btw: What do you mean by "query manager"?If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.
On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.
Edit: table variable are available in SS2K. So the code would run partially in SS2K.
I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.
When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.
I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?
Thanks for the help again guys!
There was a table variable and a CTE. I think you should probably go look those up in BOL.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
April 3, 2010 at 3:47 am
olie480 (4/2/2010)
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names....
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
...
The method or software you're using isn't relevant nor is the amount of money you spent on it.
What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.
Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?
Once this business rule is set in stone, the SQL stuff becomes rather easy...
April 3, 2010 at 11:19 am
lmu92 (4/3/2010)
olie480 (4/2/2010)
Jeff Moden (4/2/2010)
Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names....
I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.
...
The method or software you're using isn't relevant nor is the amount of money you spent on it.
What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.
Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?
Once this business rule is set in stone, the SQL stuff becomes rather easy...
Exactly.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply