Search part of string within part of string

  • Derrick,

    You ask the right question. First of all SSIS fuzzy lookup will find a

    match between "The Association of Back Pain Specialists" and

    "The Association Specialists of Pain Back " and even "Back Pain of Specialists Association The" for SURE (One just need to play with

    similarity and confidence). Those who are doubting in that can read about ETI index of just make a simple test if they cannot understand what is written in this article. The main point here is that in order to realize that "Association for Back Pain of Illinois" and "Association for Back Pain and Illness" are different companies fuzzy match is not enough at all, even if you use additional fields like address.

    What is really require here is semantic analyze and kind of semantic match and that is completely different story. Fuzzy match can only provide you with similar names and in the best case decrease the range of your potential candidates.

  • Didn't this work?


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • I had a similar problem a few years ago, and ended up with building an alias table.

    some tedious work ... but the users creativity does have limitations

  • I deal with this very often with data from my clients, If you take large vendor master you may have 25 different vendor numbers for same vendors. There is so many reasons for this that you can not control them all and over time it is an issue, some ex. decentralized systems, Update or replacement of systems , EDI and Poor user data entry or Business rules not followed.

    The issue for you is to clean up before trying to fix the dups.

    If you go into the data you will find thing like address fields have thing like CO Joe Schmo or better the name of the business in Address1. One on my favorite is PO Box 123, P.O. 123 , P.O. Box 123 and Box 123.

    There is no silver bullet to fix this but only different technique, I like to use scoring to try and do this type of clean up, were several points to match on and keep adding to the final score.

    Start with something like ZIP or part of name to find dups then build on this.

    There is several ways to clean the name remove common words like THE, LLP, INC you will need to identify for your data also remove &,@,., and - this will help.

    Scott
    I am a Senior Data Analyst/DBA
    I work with MS SQL, Oracle & several ETL Tools

    @thewolf66[/url]

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply