April 29, 2009 at 4:02 am
Hi All,
I have a requirement to clean up a DB with more than half a million records with duplicates. The challange is i need to return the duplicate records and then merge them, keeping on recod for each record set. The reason for that is that the users have been updating any of the duplicate record, so no data shouldbe lost after the clean up. Is this possible may using using other tools like SSIS or any script?
Any suggestions welcome.
Thanks.
April 29, 2009 at 4:58 am
You can do this using a script, you will probably need to use SELECT DISTINCT and GROUP BY to remove duplicates. Without knowing exactly what you are doing it is hard to give specific advices
Post your table defs, and some sample data and I am sure someone can help out..
April 29, 2009 at 5:46 am
You can duplicte records by Count(*)/ Group By or Row_Number.
But how will you find out the latest update in those duplicate records. Is there any UpdatedOnDateTime column in your tables?
April 29, 2009 at 7:07 am
@phumlo1
what exactly you want to know?
1) how to select duplicate records?
2) How to cleanup your database without keeping duplicate records?
Please give some more specification.
"Don't limit your challenges, challenge your limits"
April 29, 2009 at 8:24 am
There are lots of methods of handling duplicates in SQL. To help you design the best one for your situation, we would need table definitions ("create table..." scripts) and some sample data to put in them (insert... scripts).
When you say you want to return the duplicates and then merge them, do you mean you need to view them first before you merge them? Or do you mean you need to keep a record of some sort of the duplicates that are being removed? Or something else?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 30, 2009 at 2:41 pm
I confirm all other posts. More information imply better answers. Maybe have a look to the link in my signature.
For now a very basic way to handle this:
DECLARE @t TABLE (DuplicateId INT)
INSERT INTO @t
SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 2
; WITH
todo (RowNum) AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY DuplicateId ORDER BY DuplicateId)
FROM @t
)
DELETE FROM todo WHERE RowNum != 1
SELECT * FROM @t
Greets
Flo
May 4, 2009 at 1:31 am
Hi All,
Thanks for all the responses. Here is a sample of my DB:
Fullname, ID_no, Modifiedon, Membershipno,Adress1, Address2, Postal Code
Joe Soap, 123, 2008-09-04, 001, 1 sloane str, Hampton, ,
Joe Soap, 123, 2008-09-06, 001, 1 sloane str, Hampton, 3009
Marie Smith,456, 2008-10-10, 002, 2 Brite str,,,
Marie Smith,456, 2008-10-11, 002, 2 Brite str,Hampton,,
Marie Smith, 465, 2008-10-12,,Brite str,Hampton,3009
In this sample Joe soap has 2 records, in his case i want to keep the second one because its got more data. In Marie's case she's got
3 records, all with different data, first one has no address2 and postal code, second has no postal code, third one has got no membershipno.The three records have pieces of data that i need for Maries complete record. I need to get all the pieces into one record(merge) and delete all others.
Many Thanks.
May 4, 2009 at 2:14 am
Hi All,
Apologies for sending the sample data in the format as in my previous post. will send a proper one.
May 5, 2009 at 5:13 am
This is complex case:
To resolve this issue you can try the following way--
1. First identify the duplicate based on your requirement (may not duplicate)
2. Find the number of records associated with the duplicates.
3. As per your requirement data may come from duplicate records, so you need to write a temp table and insert all the duplicate records into the temp table and check each column value with respective column of other records and insert the data into final temp table which you require.
4. Finally delete the duplicate records from main table and insert the records of final table.
You can do this by using dynamic query.
May 10, 2009 at 11:19 am
The only way I've found to handle this sort of thing requires several passes at the data. You want the most complete data possible for each duplicate:
-- Assuming that your table name is MemberShips and that ID_no is really unique
-- for each name.
-- This first get the dups
SELECT
Fullname, ID_no, MAX(Modifiedon) AS Modifieldon
, CAST(NULL AS VARCHAR(10)) AS Membershipno
, CAST(NULL AS VARCHAR(50)) AS Address1
, CAST(NULL AS VARCHAR(50)) AS Address2
, CAST(NULL(AS VARCHAR(10)) AS PostalCode
INTO #DUP
FROM MemberShips
GROUP BY
Fullname, ID_no
HAVING COUNT(*) > 1
-- Get tne non-null MembershipNo
SELECT M.ID_no, M.Membershipno
INTO #M
FROM #Dup
INNER JOIN Memberships M ON
#Dup.ID_no = M.ID_no
WHERE
M.Membershipno IS NOT NULL
-- Get the non-null Address1
SELECT
M.ID_no, M.Address1
INTO #A1
FROM #Dup
INNER JOIN MemberShips M ON
#Dup.ID_no = M.ID_no
WHERE
M.Address1 IS NOT NULL
-- Do the same for Address2, Postal Code, etc.
-- Use these tables to update the #Dup table.
UPDATE #Dup
SET #Dup.Membershipno = #M.Membershipno
FROM #Dup
INNER JOIN #M ON
#Dup.ID_no = #M.ID_no
-- Update addresses and so forth.
-- #Dup should have something for each column if there was one in
-- the original table.
-- Delete the dups in the original table.
DELETE M
FROM #Dup
INNER JOIN Memberships M ON
#Dup.ID_no = M.ID_no
-- Re-insert the completed records from #Dup
INSERT INTO Memeberships
( Fullname, ID_no, etc.)
SELECT
Fullname, IDNo, etc
FROM #DUP
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply