De-Dupe

  • Whats the best way to find the duplicate data in a sql server table?

    I am using SOUNDEX on Company Name, but it does not work best all the time. What's the alternates? Does anyone know about any third part tools?

  • I like to use T-SQL with ROW_COUNT()

    do you have a unique key that identifies the row?

    Can you provide some more details about the table and data?

  • Likely you would need multiple passes to find typos. Duplicates would be exact matches. typically that's a

    select dupfield

    , count (*)

    from mytable

    group by dupfield

    having count(*) > 1

    If you are looking for close entries that are typo'd, soundex might work, but you'd likely need a loose filter and then manually clean things up.

  • I have a company table containing all the information about different companies (name, contact, address etc).

    I want to identify the closely matching companies based on company name.

    e.g i want to identify "White horse Company" & "White horse Co."

    Currently i am using SOUNDEX and it works well but not a best option as at times it gives weird matching results.

  • U can't do it without having a certain rule defined to the search criteria, if u want to do it through coding.

    Like:-

    White horse Company White horse Company match

    White horse Company White whore Company not a match

    White horse Company White horse Co match

    White horse Company White horse Organization match

    White horse Company Horse White Company match

    White horse Company Horse White Organization match

    Above mentioned rules are possible to implement in the coding.

    Or you can try SSIS for fuzzy lookup

  • I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.

  • Ed-86789 (10/28/2009)


    I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.

    Thanks, Can you suggest a good article explaining how it works and how to use it?

  • Don't know of an article, but if you type "Fuzzy Lookup Transformation " in BOL it will describe it. Assuming you're familiar with SSIS, this is a component in a Data Flow. You supply an input result set, such as a table or query, and the lookup table that you want to match against ( in your case the same table - I think that should be possible - or a copy of it ). Then you designate which fields you want to match.

    It will output another result set containing data from the original and lookup sets and a ranking of similarity. You could use that ranking as a basis for de-duping. You would need to throw out the 100% matches of the record to itself.

  • Thanks for your help, reading BOL now.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply