Conditional Split - Wierd Results...

  • I have some data based on first and last names from two different systems / tables.

    We're trying to map these together. I'm using a conditional split to facilitate the matching.

    I set up these rules and I output an additional column called match with a text string....

    Rules are as follows:

    Exact Match ( _similarity == 1 ) && ( _confidence == 1 )

    Better Match (_Similarity >= 0.9) && (_Confidence >= 0.9)

    Good Match (_Similarity >= 0.8) && (_Confidence >= 8.0)

    Possible Match (_Similarity >= 0.7) && (_Confidence >= 0.7)

    Least Possible (_Similarity >= 0.6) && (_Confidence >= 0.5)

    Anything not in these ranges, were written out to a No Match Likely.

    I ran the transformation into a new table. My exact matches work perfectly. All of my matches with _similarity of 1 and _confidence of 1 fell out with the Exact Match.

    What I have run into is more than 50% of my matches fell into the No Match Likely. I reviewed those matches

    and have consistently found many records where my _similarity and confidence are both greater than my least possible match. I'm perplexed why this is.

    So, my data is looking like this:

    Name 1 Name 2 Similarity Confidence Match

    Elmer Fudde Elmer Fudd .8752 .7850 Not Likely

    Furthermore, I see stuff like this:

    Punch Villos Pancho villa .5865 .5 Possible Match.

    I also have records in all of these groupings that work.

    The approximate input record count is relatively small ( 170,000 records ) The input SQL is a query and the data is pre-sorted on _similarity and _confidence in desc order coming into the split. So the matching records should start at 1 & 1.

    Any thoughts what I am missing?

  • It's a complex one to try and diagnose, because I don't think you've provided quite enough info.

    My first comment would be that your conditions are not exclusive. Eg anything that is a good match is also a possible match and a least possible. Using 'between' logic would make things tighter.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • What data type is the field? It almost looks like a rounding error.

  • The data types are the ones generated by SQL. I'm checking the values of the _similarity and _confidence fields and adding a field based on a number range.

    I posted two examples of what should not be happening. The first example was not processed by the rules, and dropped into the default no match. The 2nd example should have not been processed by the rules but was.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply