Removing duplicates and keeping 1 record with no PK

  • I have an issue with my database with multiple duplicate customers that I need to get rid of. Basically, it would be easy to do a query based on a Primary Key if the primary keys were duplicates, but the problem I am facing is that the duplicates are being originated based on two columns, FirstName and LastName.

    I was thinking that since the City column is pretty much identical to the persons FirstName and LastName I can probably use that as some sort of key to delete the duplicates successfully, but I haven't the know how on how to do this.

    So for example:

    CustNumber | FirstName | LastName | City

    345321 | John | Doe | Kellogg

    459978 | John | Doe | Kellogg

    127642 | John | Doe | Kellogg

    I would like to just keep one of these, it doesn't matter which one, but just keep one. Do I somehow Union columns together? or use AS clauses?

    I was just going to go through the 7,000 records, and just place a number 5 on the column that doesn't really matter, then do a DELETE * FROM Customer WHERE TaxNumber = 5;

    But I figured I'd ask the professionals how I can get away with this.

    Thanks in advance for your help!

  • Lots of reasons to have keys, this being another good one.

    What you could do is save your 1 record you wish to preserve to a temp table, then delete all the dups, then insert the keepers back into the table. You can use something like:

    select firstname, lastname, address from table1, group by firstname, lastname, address having count(*) > 1

    Then you delete those duplicates, then re-insert from your temp table into your original table.

    Converting oxygen into carbon dioxide, since 1955.
  • You could use Row_Number() to find the dupes...

    Something like this:

    DECLARE @tbl TABLE

    (

    CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

    )

    INSERT INTO @tbl

    SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

    SELECT 127642 ,'John','Doe','Kellogg'

    ;WITH cte AS

    (

    SELECT

    CustNumber,

    ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

    FROM @tbl

    )

    SELECT t.* -- replace with DELETE once the query is checked to give the correct results

    FROM @tbl t

    INNER JOIN cte ON t.CustNumber=cte.CustNumber

    WHERE ROW<>1



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (4/2/2010)


    You could use Row_Number() to find the dupes...

    Something like this:

    DECLARE @tbl TABLE

    (

    CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

    )

    INSERT INTO @tbl

    SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

    SELECT 127642 ,'John','Doe','Kellogg'

    ;WITH cte AS

    (

    SELECT

    CustNumber,

    ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

    FROM @tbl

    )

    SELECT t.* -- replace with DELETE once the query is checked to give the correct results

    FROM @tbl t

    INNER JOIN cte ON t.CustNumber=cte.CustNumber

    WHERE ROW<>1

    Quite similar to what I would have suggested.

    This is a good method.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • lmu92 (4/2/2010)


    You could use Row_Number() to find the dupes...

    Something like this:

    DECLARE @tbl TABLE

    (

    CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

    )

    INSERT INTO @tbl

    SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

    SELECT 127642 ,'John','Doe','Kellogg'

    ;WITH cte AS

    (

    SELECT

    CustNumber,

    ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

    FROM @tbl

    )

    SELECT t.* -- replace with DELETE once the query is checked to give the correct results

    FROM @tbl t

    INNER JOIN cte ON t.CustNumber=cte.CustNumber

    WHERE ROW<>1

    I tried using this in the query manager, but all it says it 4 rows affected, and no rows show up in the status window. Just to let you know too, that the only things that I delete were the " -- replace with DELETE once the query is checked to give the correct results"

    So the code I used was:

    DECLARE @tbl TABLE

    (

    CustNumber INT, FirstName VARCHAR(30), LastName VARCHAR(30), City VARCHAR(30)

    )

    INSERT INTO @tbl

    SELECT 345321 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459978 ,'John','Doe','Kellogg' UNION ALL

    SELECT 459979 ,'Jane','Doe','Kellogg' UNION ALL

    SELECT 127642 ,'John','Doe','Kellogg'

    ;WITH cte AS

    (

    SELECT

    CustNumber,

    ROW_NUMBER() OVER(PARTITION BY FirstName,LastName,City ORDER BY CustNumber ) AS ROW

    FROM @tbl

    )

    SELECT t.*

    FROM @tbl t

    INNER JOIN cte ON t.CustNumber=cte.CustNumber

    WHERE ROW<>1

    I will have you know that I have more than just those 3 fields within the table too. I think there are about 30 more fields or so. I just wanted to let you know what are the duplicates.

  • That's weird...

    Here's the result in the Results window when running the modified query as you posted:

    CustNumberFirstNameLastNameCity

    345321JohnDoeKellogg

    459978JohnDoeKellogg

    Message window output:

    (4 row(s) affected)

    (2 row(s) affected)

    The total number of columns doesn't really matter as long as all the "dupe candidates" are included in the PARTITION BY clause of the ROW_NUMBER function and CustNumber identifies one row.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Btw: What do you mean by "query manager"?

    If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.

    On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

    Edit: table variable are available in SS2K. So the code would run partially in SS2K.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (4/2/2010)


    Btw: What do you mean by "query manager"?

    If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.

    On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

    I think it is SQL 2K.

    4 rows inserted and then probably an error message.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • lmu92 (4/2/2010)


    Btw: What do you mean by "query manager"?

    If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.

    On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

    Edit: table variable are available in SS2K. So the code would run partially in SS2K.

    I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.

    When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.

    I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?

    Thanks for the help again guys!

  • Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/2/2010)


    Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

    I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...

    Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...

    I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.

    I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

    I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.

  • Jeff Moden (4/2/2010)


    Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

    I know... I am stuck in a dilema because the store did not have centralized data between the 2 stores, therefore we had to add certain customers to both of the stores database for customer loyalty, and other things as well... big headache...

    Now that we have all the stores centralized, I had to import the data into the headquarters Customer database, and since the primary keys were different, I am stuck in a very unfortunate situation...

    I tried already exporting to CSV and importing into excel, and doing the remove duplicates, but excel removes EVERYTHING that is duplicate e.x. John Smith, Sue Smith = both would get deleted.

    I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

    I can do simple queries, but doing things like this has been a learning curve I am trying to drive straight.

  • olie480 (4/2/2010)


    lmu92 (4/2/2010)


    Btw: What do you mean by "query manager"?

    If you're using SQL Server 2000, the query shouldn't run at all since table variable as well as ROW_NUMBER function that I've used aren't available in SS2K.

    On the other side, there is no "query manager" starting with SS2K5. It's called Management Studio. Please clarify the SQL Server version you're using.

    Edit: table variable are available in SS2K. So the code would run partially in SS2K.

    I am using a SQL Utilities program, similar to Management Studio. I am just use to it, and thats why I use it. We also have EMS Sql Manager as well, and I ran the Query in that, with the same results.

    When I get back on Monday, I will try to run the query in Management Studio to see what the effects are.

    I do have a question though. The query that you had me run, is that just making a psudeo table, and doing the work behind the scenes? or is it actually supposed to make a table?

    Thanks for the help again guys!

    There was a table variable and a CTE. I think you should probably go look those up in BOL.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • olie480 (4/2/2010)


    Jeff Moden (4/2/2010)


    Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

    ...

    I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

    ...

    The method or software you're using isn't relevant nor is the amount of money you spent on it.

    What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.

    Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?

    Once this business rule is set in stone, the SQL stuff becomes rather easy...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (4/3/2010)


    olie480 (4/2/2010)


    Jeff Moden (4/2/2010)


    Deleting data that is supposedly duplicated by FirstName and LastName is a very good way to have a sudden and very unwelcome career change. There must be some other form of identification other than just first and last names.

    ...

    I tried researching programs that do this automatically, but none worked out, and after spending about $100 on software, I came here to get some help, and it's really beating me down.

    ...

    The method or software you're using isn't relevant nor is the amount of money you spent on it.

    What Jeff is trying to tell you (I think) is that the most important issue to get this job done is a valid (and approved!) business rule on how to identify duplicates.

    Let's assume you have two people named "John Smith" in each of your stores. How are you going to identify whether you have 1,2,3, or 4 physically identical person?

    Once this business rule is set in stone, the SQL stuff becomes rather easy...

    Exactly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply