Removing duplicates and keeping 1 record with no PK

  • What I do know is that if I have 4 'John Smiths' and if they were all different people, the Address column would be different for each of those entities. So if I know that I can use the Address column as a Psudeo Primary Key column, what steps could I take to delete records according to the Address column?

    The only problem that will arise is an inconsistant value because of sales reps, ie. 2205 Vine St. or 2205 Vine St (without period) which really is not a problem, as long as I get the majority of records deleted, I would much rather sift through the ones that did not get deleted, then going through the whole database and manually delete every single duplicate.

  • olie480 (4/5/2010)


    What I do know is that if I have 4 'John Smiths' and if they were all different people, the Address column would be different for each of those entities. So if I know that I can use the Address column as a Psudeo Primary Key column, what steps could I take to delete records according to the Address column?

    The only problem that will arise is an inconsistant value because of sales reps, ie. 2205 Vine St. or 2205 Vine St (without period) which really is not a problem, as long as I get the majority of records deleted, I would much rather sift through the ones that did not get deleted, then going through the whole database and manually delete every single duplicate.

    That is an assumption. Some people do maintain several addresses, thus address is not always a good unique identifier. Also, what if the person moved? Both addresses would be valid, but you would not know which was the best address to use.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • CirquedeSQLeil (4/5/2010)


    olie480 (4/5/2010)


    What I do know is that if I have 4 'John Smiths' and if they were all different people, the Address column would be different for each of those entities. So if I know that I can use the Address column as a Psudeo Primary Key column, what steps could I take to delete records according to the Address column?

    The only problem that will arise is an inconsistant value because of sales reps, ie. 2205 Vine St. or 2205 Vine St (without period) which really is not a problem, as long as I get the majority of records deleted, I would much rather sift through the ones that did not get deleted, then going through the whole database and manually delete every single duplicate.

    That is an assumption. Some people do maintain several addresses, thus address is not always a good unique identifier. Also, what if the person moved? Both addresses would be valid, but you would not know which was the best address to use.

    Which is a total valid point. Luckily for us, addresses are not really a concern for our stores, since we do not have any mailers or anything to that nature. I did just run a query on my sample database, and it came out fine. It deleted duplicates and kept 1 record. Although like I said, I will have to sift through the records, and just delete SOME duplicates, but it's not like I have many now to delete.

    Here is the following code I ran.

    WHILE EXISTS (SELECT COUNT(*) FROM __table__ GROUP BY Zip, FirstName, Address HAVING COUNT(*) > 1)

    BEGIN

    DELETE FROM __table__ WHERE AccountNumber IN

    (

    SELECT MIN(AccountNumber) as [DeleteID]

    FROM __table__

    GROUP BY Zip, FirstName, Address

    HAVING COUNT(*) > 1

    )

    END

    Thanks for everyones input. In time I will get on the right track with SQL programming, but first I need someone to lay the tracks first, and I appreciate every minute of it.

  • You're welcome.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I've seen so many "duplicate rows" postings lately that I feel somebody has to tell people Microsoft is not charging a license fee per unique index.

    Once you have a SQL Server license you can create as many unique indexes as needed... it's free! 😀

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (4/5/2010)


    I've seen so many "duplicate rows" postings lately that I feel somebody has to tell people Microsoft is not charging a license fee per unique index.

    Once you have a SQL Server license you can create as many unique indexes as needed... it's free! 😀

    LOL

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • PaulB-TheOneAndOnly (4/5/2010)


    I've seen so many "duplicate rows" postings lately that I feel somebody has to tell people Microsoft is not charging a license fee per unique index.

    Once you have a SQL Server license you can create as many unique indexes as needed... it's free! 😀

    Lol. I understand, it's just the logical code to try to remove duplicates based on no unique index key.

  • olie480 (4/5/2010)


    CirquedeSQLeil (4/5/2010)


    olie480 (4/5/2010)


    What I do know is that if I have 4 'John Smiths' and if they were all different people, the Address column would be different for each of those entities. So if I know that I can use the Address column as a Psudeo Primary Key column, what steps could I take to delete records according to the Address column?

    The only problem that will arise is an inconsistant value because of sales reps, ie. 2205 Vine St. or 2205 Vine St (without period) which really is not a problem, as long as I get the majority of records deleted, I would much rather sift through the ones that did not get deleted, then going through the whole database and manually delete every single duplicate.

    That is an assumption. Some people do maintain several addresses, thus address is not always a good unique identifier. Also, what if the person moved? Both addresses would be valid, but you would not know which was the best address to use.

    Which is a total valid point. Luckily for us, addresses are not really a concern for our stores, since we do not have any mailers or anything to that nature. I did just run a query on my sample database, and it came out fine. It deleted duplicates and kept 1 record. Although like I said, I will have to sift through the records, and just delete SOME duplicates, but it's not like I have many now to delete.

    Here is the following code I ran.

    WHILE EXISTS (SELECT COUNT(*) FROM __table__ GROUP BY Zip, FirstName, Address HAVING COUNT(*) > 1)

    BEGIN

    DELETE FROM __table__ WHERE AccountNumber IN

    (

    SELECT MIN(AccountNumber) as [DeleteID]

    FROM __table__

    GROUP BY Zip, FirstName, Address

    HAVING COUNT(*) > 1

    )

    END

    Thanks for everyones input. In time I will get on the right track with SQL programming, but first I need someone to lay the tracks first, and I appreciate every minute of it.

    Nope... you need to lay your own tracks. It's all up to you.

    I will help you lay one track though... the proverbial "3rd rail" is to delete data that you can't recover. Touch that track and, just like the real 3rd rail, your career can come to a very sudden end. I hope you squirreled the data you deleted away somewhere.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply