Eliminating Duplicates

  • Hi experts,

    I have a question that usually come up in the sql community but with a different twist (i think). I have a table that I am trying to eliminate duplicate data from. The only twist here is that the lngCompanyID, shtCompanyID and dtmModified must all be the same in different rows before it is considered duplicate. If 2 of the columns are the same and 1 is different, it won't be duplicate.

    Is there a different way to go about this other than the regular eliminating duplicate "having count / rownumber() > 1" routine?

    Below is my test data

    CREATE TABLE tDupData

    (

    lngCompanyID INT,

    shtCompanyID INT,

    strCompanyName VARCHAR(20),

    strAddress VARCHAR(10),

    dtmModified DATETIME

    )

    --Create test data

    INSERT INTO tDupData VALUES (1,10,'CompanyOne','Address1','01/15/2003')

    INSERT INTO tDupData VALUES (2,20,'CompanyTwo','Address2','01/15/2003')

    INSERT INTO tDupData VALUES (3,30,'CompanyThree','Address3','01/15/2003')

    INSERT INTO tDupData VALUES (2,40,'CompanyTwo','Address','01/16/2003')

    INSERT INTO tDupData VALUES (3,50,'CompanyThree','Address','01/16/2003')

    -- Duplicate Data

    INSERT INTO tDupData VALUES (1,10,'CompanyOne','Address1','01/15/2003')

    GO

    Thanks for reading

  • Try this:

    ;with numbered as(SELECT rowno=row_number() over

    (partition by lngCompanyID,shtCompanyID,dtmModified order by lngCompanyID)

    , lngCompanyID,shtCompanyID,dtmModified,strAddress from #tDupData)

    select * from numbered

    Resulting in:

    rownolngCompanyIDshtCompanyIDdtmModified strAddress

    11 10 2003-01-15 00:00:00.000Address1

    21 10 2003-01-15 00:00:00.000Address1

    12 20 2003-01-15 00:00:00.000Address2

    12 40 2003-01-16 00:00:00.000Address

    13 30 2003-01-15 00:00:00.000Address3

    13 50 2003-01-16 00:00:00.000Address

    If after checking you can replace select * from numbered with a DELETE FROM #TdupData WHERE Rowno > 1 .. Check results again before moving to production.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Right on the money. I really appreciate your input. So just pretty much the same thing but include all the 3 columns in the "partition by". Sweeet. You are the best !!!

  • olsentimothy03

    So just pretty much the same thing

    No not at all close to the same thing. Using a group by with the HAVING clause

    SELECT lngCompanyID,shtCompanyID,dtmModified, COUNT(*) AS '#Dup Entries' FROM #tDupData

    GROUP BY lngCompanyID,shtCompanyID,dtmModified

    HAVING COUNT(*) > 1;

    Returned:

    lngCompanyIDshtCompanyIDdtmModified #Dup Entries

    1 102003-01-15 00:00:00.000 2

    Here is where the difference makes life easy. With above result, how do indicate which single row to delete? Easy enough to delete both rows, but as your table was construced not a easy task to code to delete just one row, and think of the difficulty, if you had said 100 or 150 duplicate rows with some individnual rows being duplicated more than once.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Understood. I have definitely learnt a lot from you today. Thanks again

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply