March 5, 2014 at 12:26 pm
I am struggling trying to clean some data and identify duplicate records. I used fuzzy grouping in SSIS and provided back a series of groups. The issue is some of the individual records can appear in multiple groups (so in reality the groups should be combined). This is best explained with an example:
Original Data
key1 key2
647942600014
647942285437
2324662490640
2324662285437
2324662066128
2222 2285437
2222 1111111
9999 1111111
9999 2222222
Should look like:
22222600014
22222285437
22222490640
22222066128
22221111111
22222222222
I only choose 2222 as the surviving key because it was the smallest number. I really do not care which number remains as long as it is the same across.
I tried playing with self joins between the tables but have had no success.
Any ideas would be appreciated.
I am using Sql Server 2008 and the number of records could 500K to 1MM.
Thanks.
March 5, 2014 at 12:34 pm
Does it need to be part of a SSIS package? If not, you can locate records with grouping:
select Key2
,min(key1) as RecordToKeep,
,count(key1) as RecordCount
from table
group by Key2
having count(key1) > 1
March 5, 2014 at 5:25 pm
I don't know anything about fuzzy grouping in SSIS but doesn't this do what you want?
WITH SampleData (key1, key2) AS
(
SELECT 64794, 2600014
UNION ALL SELECT 64794, 2285437
UNION ALL SELECT 232466, 2490640
UNION ALL SELECT 232466, 2285437
UNION ALL SELECT 232466, 2066128
UNION ALL SELECT 2222, 2285437
UNION ALL SELECT 2222, 1111111
UNION ALL SELECT 9999, 1111111
UNION ALL SELECT 9999, 2222222
)
SELECT key1=(SELECT MIN(key1) FROM SampleData)
,key2
FROM SampleData
GROUP BY key2;
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply