Find out duplicate rows from multiple table

  • Hi All,

    Recently we ran into a problem of having duplicated records in our prod database. Below is the scenario:

    The business objects generates a report file and that file is used by a third party ETL tool to populate SQL server database tables. The thing is that the ETL tool does not overwrite the exixting records but instead append to the table.

    For ex. The report file contains 75 records. The ETL tool takes the file and populates it in SQL server database tables. Now during next week the again the report is generated with additional 5 more rows (these are new rows), so a total of 75 + 5 = 80 rows are there in the generated report file by business objects. Now the ETL tool takes the 80 records and appends it to the respective tables depending on the fields. So various tables have now duplicate records.

    How can I find out which tables have duplicate records. there are 300 tables and I have the tables and the fields which business objects use to create its report ?

    the environment is SQL 2k5 standard edition on Win 2k3 standard both on sp2.

    I am new to SQL programing.

    Please help as soon as possible,

    Thanks,

    \\K

    ______________________________________________________________________________________________________________________________________________________________________________________
    HTH !
    Kin
    MCTS : 2005, 2008
    Active SQL Server Community Contributor 🙂

  • This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?

    Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.

    Cheers,

    Brian

  • beezell (3/4/2009)


    This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?

    Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.

    Cheers,

    Brian

    Hi,

    The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.

    CustomerID refID paymentcode paymentdate

    1 112 VCFX21 2/2/2000

    2 113 VCFX21 2/2/2000

    3 114 BCTN30 2/3/2000

    4 119 BCTN30 2/3/2000

    So I want to remove the duplicates from the table.

    Hope I am clear in my question.

    Thanks for any help,

    \\K 🙂

    Commit to CAN I - Constant And Never ending Improvement

    ______________________________________________________________________________________________________________________________________________________________________________________
    HTH !
    Kin
    MCTS : 2005, 2008
    Active SQL Server Community Contributor 🙂

  • your example shows 4 columns, none of which intuitively seem to be the PK...

    Is the PK of each table ALWAYS an identity field, or is always the first column of the table(or both)?

    If it is, you could do something like this to at least identify dupes:

    if not, you have to find the commonalities somehow.

    [font="Courier New"]

    SELECT TABLE_NAME,

      'SELECT DISTINCT ' + STUFF(

              (

              SELECT ', ' + B.COLUMN_NAME

              FROM information_schema.columns B

              WHERE A.TABLE_NAME = B.TABLE_NAME

              FOR XML PATH(''))

              ,

         1, 2, '') + ' FROM ' + TABLE_NAME + ' GROUP BY '

    + STUFF(

              (

              SELECT ', ' + B.COLUMN_NAME

              FROM information_schema.columns B

              WHERE A.TABLE_NAME = B.TABLE_NAME

              FOR XML PATH(''))

              ,

         1, 2, '') + ' HAVING COUNT(*) > 1'

    FROM information_schema.columns A

    WHERE ORDINAL_POSITION > 1

    GROUP BY TABLE_NAME[/font]

    you could do diagnostic strings like this to analyze and remove the duplicates by building the strings to examine and remove the duplicates.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • SQL_Quest (3/4/2009)


    beezell (3/4/2009)


    This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?

    Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.

    Cheers,

    Brian

    Hi,

    The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.

    CustomerID refID paymentcode paymentdate

    1 112 VCFX21 2/2/2000

    2 113 VCFX21 2/2/2000

    3 114 BCTN30 2/3/2000

    4 119 BCTN30 2/3/2000

    So I want to remove the duplicates from the table.

    Hope I am clear in my question.

    Thanks for any help,

    \\K 🙂

    Commit to CAN I - Constant And Never ending Improvement

    Since you didn't post data in a readily consumable format and you haven't identified such things as which dupe you want to actually keep (first or last), you leave me to my own devices... hope you get it...

    Here's a test table with a wad of dupes... don't let the number 1 million scare you... it only takes about 30 seconds to build the table and add the PK.

    DROP TABLE JBMTest

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    Now, let's just say we have a couple of rules... the first rule is that we only want to keep the "latest" dupe from each group of dupes... in this case, that means the one with highest RowNum. The other rule is that which constitutes a dupe will be in the columns SomeInt and SomeLetters2.

    So, first off... let's just see how many dupes we have... the -1 is because we're going to delete all but one of each group of dupes...

    SELECT SomeInt, SomeLetters2, COUNT(*)-1 AS DupeCount

    FROM dbo.JBMTest

    GROUP BY SomeInt, SomeLetters2

    HAVING COUNT(*) > 1

    COMPUTE SUM(COUNT(*)-1)

    So, now we know what to expect... now we can delete the dupes and, if the rowcount doesn't match what came from above, we can do a ROLLBACK to keep our respective bosses from introducing us to the local soup line... 😀

    BEGIN TRANSACTION

    DELETE t1

    FROM dbo.JBMTest t1 WITH (TABLOCKX)

    INNER JOIN dbo.JBMTest t2 WITH (TABLOCKX)

    ON t1.SomeInt = t2.SomeInt

    AND t1.SomeLetters2 = t2.SomeLetters2

    AND t1.RowNum < t2.RowNum

    -- ROLLBACK -- COMMIT

    Even if the table has of 800,000 dupes in it, it still takes only a minute to execute. With only 14,00 dupes, it only takes about 9 seconds to run.

    Now, most folks that know me real well would scream "Foul! That's a triangular join!" But, the optimizer doesn't think so because, guess what? No aggregation... it just does a hash match join and off it goes with no duplicated rows being generated in the background. It's nasty fast for what it does, which is a full index scan.

    Oh yeah... if the rowcount matches the earlier rowcount, don't forget to COMMIT or you'll keep the table locked. Obviously, you'll wanna do ROLLBACK if the rowcounts don't match.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/4/2009)


    ... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.

    Hi,

    Thanks for the reply.

    Let me clarify my situation .....

    The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows

    Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.

    I have to find our the duplicates and remove from all the tables to which the ETL populates the data.

    My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:

    select * from table_name

    where pk_id

    not in

    (select min(id) from table_name

    group by column_name1,columan_name2)

    to find the duplicates. But this is a time consuming process.

    Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?

    I am new to SQL server and not much good at programing ...

    Thanks in advance to all,

    \\K

    ______________________________________________________________________________________________________________________________________________________________________________________
    HTH !
    Kin
    MCTS : 2005, 2008
    Active SQL Server Community Contributor 🙂

  • SQL_Quest (3/5/2009)


    Jeff Moden (3/4/2009)


    ... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.

    Hi,

    Thanks for the reply.

    Let me clarify my situation .....

    The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows

    Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.

    I have to find our the duplicates and remove from all the tables to which the ETL populates the data.

    My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:

    select * from table_name

    where pk_id

    not in

    (select min(id) from table_name

    group by column_name1,columan_name2)

    to find the duplicates. But this is a time consuming process.

    Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?

    I am new to SQL server and not much good at programing ...

    Thanks in advance to all,

    \\K

    Ummm.... so what's wrong with the rather lengthy post a couple above this one that explained how to delete dupes?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply