September 6, 2006 at 10:25 am
I am experimenting with the Fuzzying Grouping Data Flow Transformations and I am getting unexpected behavior. I set the similarity threshold down to .25 ( I have tried lower and higher values).
The Issue:
I am not sure why it won't group 'Dan' with 'Daniel'. It does it properly for X-RAY Technologist.
Any Ideas? Thanks. Daniel
_key_in | _key_out | _score | Name | Name_clean | _Similarity_Name |
1 | 1 | 1 | DANIEL | DANIEL | 1 |
3 | 1 | 0.665343 | DANIELSON | DANIEL | 0.6653433 |
2 | 2 | 1 | DAN | DAN | 1 |
4 | 4 | 1 | X-RAY TECHNOLOGIST | X-RAY TECHNOLOGIST | 1 |
5 | 4 | 0.9 | XRAY TECHNOLOGIST | X-RAY TECHNOLOGIST | 0.9 |
7 | 4 | 0.77257 | X-RAY TECH | X-RAY TECHNOLOGIST | 0.7725695 |
6 | 4 | 0.561458 | XRAY TECH | X-RAY TECHNOLOGIST | 0.5614583 |
September 11, 2006 at 8:00 am
This was removed by the editor as SPAM
September 25, 2006 at 11:49 am
From
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp
Fuzzy Lookup and Fuzzy Grouping use a custom, domain-independent distance function that takes into account the edit distance (for example, "hits" is distance 2 from "bit"), the number of tokens, token order, and relative frequencies.
In your case
DAN distance from DANIEL is 3.
There are no other tokens to contribute to the score.
"X-RAY TECH" and "X-RAY TECHNOLOGIST" have common token.
However the edit distance from
DANIELSON to DANIEL is also 3. I can only guess than their "custom, domain-independent distance function" put different weight on delete editing versus insert editing.
September 25, 2006 at 3:30 pm
Thanks for the info.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply