March 4, 2009 at 12:50 pm
Hi All,
Recently we ran into a problem of having duplicated records in our prod database. Below is the scenario:
The business objects generates a report file and that file is used by a third party ETL tool to populate SQL server database tables. The thing is that the ETL tool does not overwrite the exixting records but instead append to the table.
For ex. The report file contains 75 records. The ETL tool takes the file and populates it in SQL server database tables. Now during next week the again the report is generated with additional 5 more rows (these are new rows), so a total of 75 + 5 = 80 rows are there in the generated report file by business objects. Now the ETL tool takes the 80 records and appends it to the respective tables depending on the fields. So various tables have now duplicate records.
How can I find out which tables have duplicate records. there are 300 tables and I have the tables and the fields which business objects use to create its report ?
the environment is SQL 2k5 standard edition on Win 2k3 standard both on sp2.
I am new to SQL programing.
Please help as soon as possible,
Thanks,
\\K
______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂
March 4, 2009 at 12:56 pm
This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?
Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.
Cheers,
Brian
March 4, 2009 at 5:19 pm
beezell (3/4/2009)
This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.
Cheers,
Brian
Hi,
The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.
CustomerID refID paymentcode paymentdate
1 112 VCFX21 2/2/2000
2 113 VCFX21 2/2/2000
3 114 BCTN30 2/3/2000
4 119 BCTN30 2/3/2000
So I want to remove the duplicates from the table.
Hope I am clear in my question.
Thanks for any help,
\\K 🙂
Commit to CAN I - Constant And Never ending Improvement
______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂
March 4, 2009 at 6:07 pm
your example shows 4 columns, none of which intuitively seem to be the PK...
Is the PK of each table ALWAYS an identity field, or is always the first column of the table(or both)?
If it is, you could do something like this to at least identify dupes:
if not, you have to find the commonalities somehow.
[font="Courier New"]
SELECT TABLE_NAME,
'SELECT DISTINCT ' + STUFF(
(
SELECT ', ' + B.COLUMN_NAME
FROM information_schema.columns B
WHERE A.TABLE_NAME = B.TABLE_NAME
FOR XML PATH(''))
,
1, 2, '') + ' FROM ' + TABLE_NAME + ' GROUP BY '
+ STUFF(
(
SELECT ', ' + B.COLUMN_NAME
FROM information_schema.columns B
WHERE A.TABLE_NAME = B.TABLE_NAME
FOR XML PATH(''))
,
1, 2, '') + ' HAVING COUNT(*) > 1'
FROM information_schema.columns A
WHERE ORDINAL_POSITION > 1
GROUP BY TABLE_NAME[/font]
you could do diagnostic strings like this to analyze and remove the duplicates by building the strings to examine and remove the duplicates.
Lowell
March 4, 2009 at 10:33 pm
SQL_Quest (3/4/2009)
beezell (3/4/2009)
This sounds like a bit of a complex problem. The issue will be that each table has a different column that will identifiy it as being duplicated. Is there some more information that might be useful in formulating rules for identifying duplicate values across all of the tables?Here is an example of what might be a rule: Every table has the Column [ID] which identifies it. If this column has two of the same value then these are considered duplicates.
Cheers,
Brian
Hi,
The problem is that the PK is not duplicated. Other fields are being duplicated. For Eg.
CustomerID refID paymentcode paymentdate
1 112 VCFX21 2/2/2000
2 113 VCFX21 2/2/2000
3 114 BCTN30 2/3/2000
4 119 BCTN30 2/3/2000
So I want to remove the duplicates from the table.
Hope I am clear in my question.
Thanks for any help,
\\K 🙂
Commit to CAN I - Constant And Never ending Improvement
Since you didn't post data in a readily consumable format and you haven't identified such things as which dupe you want to actually keep (first or last), you leave me to my own devices... hope you get it...
Here's a test table with a wad of dupes... don't let the number 1 million scare you... it only takes about 30 seconds to build the table and add the PK.
DROP TABLE JBMTest
GO
--===== Create and populate a 1,000,000 row test table.
-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (RowNum)
Now, let's just say we have a couple of rules... the first rule is that we only want to keep the "latest" dupe from each group of dupes... in this case, that means the one with highest RowNum. The other rule is that which constitutes a dupe will be in the columns SomeInt and SomeLetters2.
So, first off... let's just see how many dupes we have... the -1 is because we're going to delete all but one of each group of dupes...
SELECT SomeInt, SomeLetters2, COUNT(*)-1 AS DupeCount
FROM dbo.JBMTest
GROUP BY SomeInt, SomeLetters2
HAVING COUNT(*) > 1
COMPUTE SUM(COUNT(*)-1)
So, now we know what to expect... now we can delete the dupes and, if the rowcount doesn't match what came from above, we can do a ROLLBACK to keep our respective bosses from introducing us to the local soup line... 😀
BEGIN TRANSACTION
DELETE t1
FROM dbo.JBMTest t1 WITH (TABLOCKX)
INNER JOIN dbo.JBMTest t2 WITH (TABLOCKX)
ON t1.SomeInt = t2.SomeInt
AND t1.SomeLetters2 = t2.SomeLetters2
AND t1.RowNum < t2.RowNum
-- ROLLBACK -- COMMIT
Even if the table has of 800,000 dupes in it, it still takes only a minute to execute. With only 14,00 dupes, it only takes about 9 seconds to run.
Now, most folks that know me real well would scream "Foul! That's a triangular join!" But, the optimizer doesn't think so because, guess what? No aggregation... it just does a hash match join and off it goes with no duplicated rows being generated in the background. It's nasty fast for what it does, which is a full index scan.
Oh yeah... if the rowcount matches the earlier rowcount, don't forget to COMMIT or you'll keep the table locked. Obviously, you'll wanna do ROLLBACK if the rowcounts don't match.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2009 at 10:40 pm
... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 5, 2009 at 12:18 pm
Jeff Moden (3/4/2009)
... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.
Hi,
Thanks for the reply.
Let me clarify my situation .....
The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows
Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.
I have to find our the duplicates and remove from all the tables to which the ETL populates the data.
My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:
select * from table_name
where pk_id
not in
(select min(id) from table_name
group by column_name1,columan_name2)
to find the duplicates. But this is a time consuming process.
Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?
I am new to SQL server and not much good at programing ...
Thanks in advance to all,
\\K
______________________________________________________________________________________________________________________________________________________________________________________
HTH !
Kin
MCTS : 2005, 2008
Active SQL Server Community Contributor 🙂
March 5, 2009 at 1:45 pm
SQL_Quest (3/5/2009)
Jeff Moden (3/4/2009)
... and, no... sorry... I don't know how to write a script to do it for 300 different tables auto-magically... no way to know which columns make up the "duplicate" columns. However, if ALL 300 tables are identical in structure in every way except the table name, post back because I do know how to pull that little trick off, especially in 2k5.Hi,
Thanks for the reply.
Let me clarify my situation .....
The situation is that some 3rd party has developed an ETL tool which takes a flat file from Business objects and converts it into xml file and loads it into a database. The database is having more than 300 tables but out of it only 8 tables are used (i.e the ETL loads the data in 8 tables only). I have the 8 tables and they are related by PK and FK relationships. The tables are very small some with 80 rows and max is 300 rows
Everytime the business object generates a whole new report (i.e the old + the new), so there is duplication. Till now there is double duplication only. i.e each row is having another duplicate.
I have to find our the duplicates and remove from all the tables to which the ETL populates the data.
My approach is : I am taking the table and the field name which is used by business objects and then started digging into the tables to see the duplicate. I am runninng below query:
select * from table_name
where pk_id
not in
(select min(id) from table_name
group by column_name1,columan_name2)
to find the duplicates. But this is a time consuming process.
Is there a way to delete in cascade the rows and the corresponding rows from all the table which are having any dependency on the PK row of the table which I am deleting?
I am new to SQL server and not much good at programing ...
Thanks in advance to all,
\\K
Ummm.... so what's wrong with the rather lengthy post a couple above this one that explained how to delete dupes?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply