Copying data and deleting duplicates

Question

Copying data and deleting duplicates

l.danks

Mr or Mrs. 500

Points: 503
More actions
March 13, 2014 at 10:59 am

#305084

Hi all
I have a table structure with a 16 digit numeric ID field, two 2000 character varchar fields, and a timestamp. Because of an error in a webpage somewhere (as yet undiscovered), instead of having one row with a distinct ID field and comments in two varchar fields, the comments have been split into two rows, both with the same ID, but with different timestamps.
so instead of seeing:
ld_id, ld_comments1, ld_comments2, timestamp
123456789, les was here, and here too, 0x0000000001ECAC49
I see:
ld_id, ld_comments1, ld_comments2, timestamp
123456789, NULL, and here too, 0x0000000001ECAC49
123456789, les was here, NULL, 0x0000000001ECAC4A
I can see what needs to be done to fix the data (*I think*) - find the rows where the ids match, then identify the row with the larger timestamp and insert the comment from comments1 into the row with the same ID but lesser timestamp. This leaves us with two identical rows (excepting the timestamp). Finally deleting one of the (now) duplicated rows.
Any ideas how to crack this one peeps?
One note, not all IDs are duplicates, some have saved fine and would not need adjusting.
Any help appreciated as always

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

djj SSCoach Points: 18831 More actions · Answer 1

Use cte finding the max time stamp into one table and the min timestamp into another. Use COALESCE to insert into a temp table before updating the original

Unfortunately I do not have time to come up with an example.

Andrew Kernodle SSCertifiable Points: 5314 More actions · Answer 2

I'll have a shot at the UPDATE part of this:

CREATE TABLE #Test(ID int,Comments1 varchar(2000),Comments2 varchar(2000),Stamp TimeStamp)

INSERT INTO #Test(ID,Comments1,Comments2)

SELECT 123456789, NULL, 'and here too'

UNION ALL

SELECT 123456789,'les was here',NULL;

WITH CTE(ID,Comment) AS(

SELECT ID,Comments1

FROM #Test

WHERE Comments2 IS NULL)

UPDATE #Test

SET Comments1 = Comment

FROM #Test

INNER JOIN CTE

ON #Test.ID = CTE.ID

WHERE Comments1 IS NULL

SELECT * FROM #Test

This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.

I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!

After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM

WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.

EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.

- 😀

digitalox SSCommitted Points: 1830 More actions · Answer 3

this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.

using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂

;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)

AS

(

SELECT

ID,

Comments1,

Comments2

, Stamp,

Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)

FROM #Test

)

SELECT

t1.id,

isnull(t1.Comments1,t2.Comments1) Comments1,

isnull(t1.Comments2,t2.Comments2) Comments2,

t1.stamp

from #Test t1

join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp

--where t1.id in

join (

select id,MAX(stamp) stamp FROM bad_recs

WHERE Ranking > 1 group by id

) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp

l.danks Mr or Mrs. 500 Points: 503 More actions · Answer 4

Thanks guys - I'm going to experiment with both of these queries today, I'll obviously report back on their success (feel the confidence 😀 )

And thanks for taking the time to think it through, I'm usually very poor at articulating what I'm trying to achieve.

l.danks Mr or Mrs. 500 Points: 503 More actions · Answer 5

hisakimatama (3/13/2014)
I'll have a shot at the UPDATE part of this:
This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.
I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!
After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM
WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.
EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.

Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.

I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.

l.danks Mr or Mrs. 500 Points: 503 More actions · Answer 6

digitalox (3/13/2014)
this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.
using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂
;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)
AS
(
SELECT
ID,
Comments1,
Comments2
, Stamp,
Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)
FROM #Test
)
SELECT
t1.id,
isnull(t1.Comments1,t2.Comments1) Comments1,
isnull(t1.Comments2,t2.Comments2) Comments2,
t1.stamp
from #Test t1
join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp
--where t1.id in
join (
select id,MAX(stamp) stamp FROM bad_recs
WHERE Ranking > 1 group by id
) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp

Took a little time to try and understand this one, but cracked it, created new table for the join and deleted from the main table based on that...

..and bingo. Copy Data Fixed.

Now I just need to wait until the system is not being used and fix the real data.

Thank you both very much

Andrew Kernodle SSCertifiable Points: 5314 More actions · Answer 7

l.danks (3/14/2014)
hisakimatama (3/13/2014)
I'll have a shot at the UPDATE part of this:
This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.
I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!
After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM
WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.
EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.
Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.
I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.

Ah, gotcha. Glad we could help you work the problem out, though! 🙂

- 😀