Fuzzy Grouping - Question

  • I am experimenting with the Fuzzying Grouping Data Flow Transformations and I am getting unexpected behavior.  I set the similarity threshold down to .25 ( I have tried lower and higher values).

    The Issue:

    I am not sure why it won't group 'Dan' with 'Daniel'. It does it properly for X-RAY Technologist.

    Any Ideas?  Thanks.  Daniel

     

    _key_in_key_out_scoreNameName_clean_Similarity_Name
    111DANIELDANIEL1
    310.665343DANIELSONDANIEL0.6653433
    221DANDAN1
    441X-RAY TECHNOLOGISTX-RAY TECHNOLOGIST1
    540.9XRAY TECHNOLOGISTX-RAY TECHNOLOGIST0.9
    740.77257X-RAY TECHX-RAY TECHNOLOGIST0.7725695
    640.561458XRAY TECHX-RAY TECHNOLOGIST0.5614583
  • This was removed by the editor as SPAM

  • From

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp

    Fuzzy Lookup and Fuzzy Grouping use a custom, domain-independent distance function that takes into account the edit distance (for example, "hits" is distance 2 from "bit"), the number of tokens, token order, and relative frequencies.

    In your case

    DAN distance from DANIEL is 3.

    There are no other tokens to contribute to the score.

    "X-RAY TECH" and "X-RAY TECHNOLOGIST" have common token.

    However the edit distance from

    DANIELSON to DANIEL is also 3. I can only guess than their "custom, domain-independent distance function" put different weight on delete editing versus insert editing.

  • Thanks for the info.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply