March 13, 2014 at 10:59 am
Hi all
I have a table structure with a 16 digit numeric ID field, two 2000 character varchar fields, and a timestamp. Because of an error in a webpage somewhere (as yet undiscovered), instead of having one row with a distinct ID field and comments in two varchar fields, the comments have been split into two rows, both with the same ID, but with different timestamps.
so instead of seeing:
ld_id, ld_comments1, ld_comments2, timestamp
123456789, les was here, and here too, 0x0000000001ECAC49
I see:
ld_id, ld_comments1, ld_comments2, timestamp
123456789, NULL, and here too, 0x0000000001ECAC49
123456789, les was here, NULL, 0x0000000001ECAC4A
I can see what needs to be done to fix the data (*I think*) - find the rows where the ids match, then identify the row with the larger timestamp and insert the comment from comments1 into the row with the same ID but lesser timestamp. This leaves us with two identical rows (excepting the timestamp). Finally deleting one of the (now) duplicated rows.
Any ideas how to crack this one peeps?
One note, not all IDs are duplicates, some have saved fine and would not need adjusting.
Any help appreciated as always
March 13, 2014 at 12:27 pm
Use cte finding the max time stamp into one table and the min timestamp into another. Use COALESCE to insert into a temp table before updating the original
Unfortunately I do not have time to come up with an example.
March 13, 2014 at 1:55 pm
I'll have a shot at the UPDATE part of this:
CREATE TABLE #Test(ID int,Comments1 varchar(2000),Comments2 varchar(2000),Stamp TimeStamp)
INSERT INTO #Test(ID,Comments1,Comments2)
SELECT 123456789, NULL, 'and here too'
UNION ALL
SELECT 123456789,'les was here',NULL;
WITH CTE(ID,Comment) AS(
SELECT ID,Comments1
FROM #Test
WHERE Comments2 IS NULL)
UPDATE #Test
SET Comments1 = Comment
FROM #Test
INNER JOIN CTE
ON #Test.ID = CTE.ID
WHERE Comments1 IS NULL
SELECT * FROM #Test
This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.
I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!
After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM
WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.
EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.
- 😀
March 13, 2014 at 3:54 pm
this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.
using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂
;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)
AS
(
SELECT
ID,
Comments1,
Comments2
, Stamp,
Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)
FROM #Test
)
SELECT
t1.id,
isnull(t1.Comments1,t2.Comments1) Comments1,
isnull(t1.Comments2,t2.Comments2) Comments2,
t1.stamp
from #Test t1
join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp
--where t1.id in
join (
select id,MAX(stamp) stamp FROM bad_recs
WHERE Ranking > 1 group by id
) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp
March 14, 2014 at 3:22 am
Thanks guys - I'm going to experiment with both of these queries today, I'll obviously report back on their success (feel the confidence 😀 )
And thanks for taking the time to think it through, I'm usually very poor at articulating what I'm trying to achieve.
March 14, 2014 at 6:48 am
hisakimatama (3/13/2014)
I'll have a shot at the UPDATE part of this:
This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.
I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!
After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM
WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.
EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.
Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.
I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.
March 14, 2014 at 7:43 am
digitalox (3/13/2014)
this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂
;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)
AS
(
SELECT
ID,
Comments1,
Comments2
, Stamp,
Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)
FROM #Test
)
SELECT
t1.id,
isnull(t1.Comments1,t2.Comments1) Comments1,
isnull(t1.Comments2,t2.Comments2) Comments2,
t1.stamp
from #Test t1
join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp
--where t1.id in
join (
select id,MAX(stamp) stamp FROM bad_recs
WHERE Ranking > 1 group by id
) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp
Took a little time to try and understand this one, but cracked it, created new table for the join and deleted from the main table based on that...
..and bingo. Copy Data Fixed.
Now I just need to wait until the system is not being used and fix the real data.
Thank you both very much
March 14, 2014 at 9:37 am
l.danks (3/14/2014)
hisakimatama (3/13/2014)
I'll have a shot at the UPDATE part of this:
This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.
I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!
After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM
WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.
EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.
Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.
I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.
Ah, gotcha. Glad we could help you work the problem out, though! 🙂
- 😀
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply