Find out duplicate rows from multiple table

Question

Find out duplicate rows from multiple table

SQLQuest29

SSCrazy Eights

Points: 8264
More actions
March 4, 2009 at 12:50 pm

#155365

Hi All,
Recently we ran into a problem of having duplicated records in our prod database. Below is the scenario:
The business objects generates a report file and that file is used by a third party ETL tool to populate SQL server database tables. The thing is that the ETL tool does not overwrite the exixting records but instead append to the table.
For ex. The report file contains 75 records. The ETL tool takes the file and populates it in SQL server database tables. Now during next week the again the report is generated with additional 5 more rows (these are new rows), so a total of 75 + 5 = 80 rows are there in the generated report file by business objects. Now the ETL tool takes the 80 records and appends it to the respective tables depending on the fields. So various tables have now duplicate records.
How can I find out which tables have duplicate records. there are 300 tables and I have the tables and the fields which business objects use to create its report ?
the environment is SQL 2k5 standard edition on Win 2k3 standard both on sp2.
I am new to SQL programing.
Please help as soon as possible,
Thanks,
\\K
______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

beezell Hall of Fame Points: 3112 More actions · Answer 1

This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?

Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.

Cheers,

Brian

SQLQuest29 SSCrazy Eights Points: 8264 More actions · Answer 2

beezell (3/4/2009)
This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?
Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.
Cheers,
Brian

Hi,

The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.

CustomerID refID paymentcode paymentdate

1 112 VCFX21 2/2/2000

2 113 VCFX21 2/2/2000

3 114 BCTN30 2/3/2000

4 119 BCTN30 2/3/2000

So I want to remove the duplicates from the table.

Hope I am clear in my question.

Thanks for any help,

\\K 🙂

Commit to CAN I - Constant And Never ending Improvement

______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂

Lowell SSC Guru Points: 323518 More actions · Answer 3

your example shows 4 columns, none of which intuitively seem to be the PK...

Is the PK of each table ALWAYS an identity field, or is always the first column of the table(or both)?

If it is, you could do something like this to at least identify dupes:

if not, you have to find the commonalities somehow.

[font="Courier New"]

SELECT TABLE_NAME,

'SELECT DISTINCT ' + STUFF(

(

SELECT ', ' + B.COLUMN_NAME

FROM information_schema.columns B

WHERE A.TABLE_NAME = B.TABLE_NAME

FOR XML PATH(''))

,

1, 2, '') + ' FROM ' + TABLE_NAME + ' GROUP BY '

+ STUFF(

(

SELECT ', ' + B.COLUMN_NAME

FROM information_schema.columns B

WHERE A.TABLE_NAME = B.TABLE_NAME

FOR XML PATH(''))

,

1, 2, '') + ' HAVING COUNT(*) > 1'

FROM information_schema.columns A

WHERE ORDINAL_POSITION > 1

GROUP BY TABLE_NAME[/font]

you could do diagnostic strings like this to analyze and remove the duplicates by building the strings to examine and remove the duplicates.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

SQL_Quest (3/4/2009)
beezell (3/4/2009)
This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?
Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.
Cheers,
Brian
Hi,
The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.
CustomerID refID paymentcode paymentdate
1 112 VCFX21 2/2/2000
2 113 VCFX21 2/2/2000
3 114 BCTN30 2/3/2000
4 119 BCTN30 2/3/2000
So I want to remove the duplicates from the table.
Hope I am clear in my question.
Thanks for any help,
\\K 🙂
Commit to CAN I - Constant And Never ending Improvement

Since you didn't post data in a readily consumable format and you haven't identified such things as which dupe you want to actually keep (first or last), you leave me to my own devices... hope you get it...

Here's a test table with a wad of dupes... don't let the number 1 million scare you... it only takes about 30 seconds to build the table and add the PK.

DROP TABLE JBMTest

GO

--===== Create and populate a 1,000,000 row test table.

-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

-- for all rows.

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeHex12 = RIGHT(NEWID(),12)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== A table is not properly formed unless a Primary Key has been assigned

ALTER TABLE dbo.JBMTest

ADD PRIMARY KEY CLUSTERED (RowNum)

Now, let's just say we have a couple of rules... the first rule is that we only want to keep the "latest" dupe from each group of dupes... in this case, that means the one with highest RowNum. The other rule is that which constitutes a dupe will be in the columns SomeInt and SomeLetters2.

So, first off... let's just see how many dupes we have... the -1 is because we're going to delete all but one of each group of dupes...

SELECT SomeInt, SomeLetters2, COUNT(*)-1 AS DupeCount

FROM dbo.JBMTest

GROUP BY SomeInt, SomeLetters2

HAVING COUNT(*) > 1

COMPUTE SUM(COUNT(*)-1)

So, now we know what to expect... now we can delete the dupes and, if the rowcount doesn't match what came from above, we can do a ROLLBACK to keep our respective bosses from introducing us to the local soup line... 😀

BEGIN TRANSACTION

DELETE t1

FROM dbo.JBMTest t1 WITH (TABLOCKX)

INNER JOIN dbo.JBMTest t2 WITH (TABLOCKX)

ON t1.SomeInt = t2.SomeInt

AND t1.SomeLetters2 = t2.SomeLetters2

AND t1.RowNum < t2.RowNum

-- ROLLBACK -- COMMIT

Even if the table has of 800,000 dupes in it, it still takes only a minute to execute. With only 14,00 dupes, it only takes about 9 seconds to run.

Now, most folks that know me real well would scream "Foul! That's a triangular join!" But, the optimizer doesn't think so because, guess what? No aggregation... it just does a hash match join and off it goes with no duplicated rows being generated in the background. It's nasty fast for what it does, which is a full index scan.

Oh yeah... if the rowcount matches the earlier rowcount, don't forget to COMMIT or you'll keep the table locked. Obviously, you'll wanna do ROLLBACK if the rowcounts don't match.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLQuest29 SSCrazy Eights Points: 8264 More actions · Answer 6

Jeff Moden (3/4/2009)
... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.

Hi,

Thanks for the reply.

Let me clarify my situation .....

The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows

Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.

I have to find our the duplicates and remove from all the tables to which the ETL populates the data.

My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:

select * from table_name

where pk_id

not in

(select min(id) from table_name

group by column_name1,columan_name2)

to find the duplicates. But this is a time consuming process.

Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?

I am new to SQL server and not much good at programing ...

Thanks in advance to all,

\\K

______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

SQL_Quest (3/5/2009)
Jeff Moden (3/4/2009)
... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.
Hi,
Thanks for the reply.
Let me clarify my situation .....
The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows
Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.
I have to find our the duplicates and remove from all the tables to which the ETL populates the data.
My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:
select * from table_name
where pk_id
not in
(select min(id) from table_name
group by column_name1,columan_name2)
to find the duplicates. But this is a time consuming process.
Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?
I am new to SQL server and not much good at programing ...
Thanks in advance to all,
\\K

Ummm.... so what's wrong with the rather lengthy post a couple above this one that explained how to delete dupes?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)