Combining keys to create a master list

  • I am struggling trying to clean some data and identify duplicate records. I used fuzzy grouping in SSIS and provided back a series of groups. The issue is some of the individual records can appear in multiple groups (so in reality the groups should be combined). This is best explained with an example:

    Original Data

    key1 key2

    647942600014

    647942285437

    2324662490640

    2324662285437

    2324662066128

    2222 2285437

    2222 1111111

    9999 1111111

    9999 2222222

    Should look like:

    22222600014

    22222285437

    22222490640

    22222066128

    22221111111

    22222222222

    I only choose 2222 as the surviving key because it was the smallest number. I really do not care which number remains as long as it is the same across.

    I tried playing with self joins between the tables but have had no success.

    Any ideas would be appreciated.

    I am using Sql Server 2008 and the number of records could 500K to 1MM.

    Thanks.

  • Does it need to be part of a SSIS package? If not, you can locate records with grouping:

    select Key2

    ,min(key1) as RecordToKeep,

    ,count(key1) as RecordCount

    from table

    group by Key2

    having count(key1) > 1

  • I don't know anything about fuzzy grouping in SSIS but doesn't this do what you want?

    WITH SampleData (key1, key2) AS

    (

    SELECT 64794, 2600014

    UNION ALL SELECT 64794, 2285437

    UNION ALL SELECT 232466, 2490640

    UNION ALL SELECT 232466, 2285437

    UNION ALL SELECT 232466, 2066128

    UNION ALL SELECT 2222, 2285437

    UNION ALL SELECT 2222, 1111111

    UNION ALL SELECT 9999, 1111111

    UNION ALL SELECT 9999, 2222222

    )

    SELECT key1=(SELECT MIN(key1) FROM SampleData)

    ,key2

    FROM SampleData

    GROUP BY key2;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply