Conditional Split - Wierd Results...

Question

Conditional Split - Wierd Results...

etm1109

Ten Centuries

Points: 1107
More actions
March 11, 2010 at 10:39 am

#138632

I have some data based on first and last names from two different systems / tables.
We're trying to map these together. I'm using a conditional split to facilitate the matching.
I set up these rules and I output an additional column called match with a text string....
Rules are as follows:
Exact Match ( _similarity == 1 ) && ( _confidence == 1 )
Better Match (_Similarity >= 0.9) && (_Confidence >= 0.9)
Good Match (_Similarity >= 0.8) && (_Confidence >= 8.0)
Possible Match (_Similarity >= 0.7) && (_Confidence >= 0.7)
Least Possible (_Similarity >= 0.6) && (_Confidence >= 0.5)
Anything not in these ranges, were written out to a No Match Likely.
I ran the transformation into a new table. My exact matches work perfectly. All of my matches with _similarity of 1 and _confidence of 1 fell out with the Exact Match.
What I have run into is more than 50% of my matches fell into the No Match Likely. I reviewed those matches
and have consistently found many records where my _similarity and confidence are both greater than my least possible match. I'm perplexed why this is.
So, my data is looking like this:
Name 1 Name 2 Similarity Confidence Match
Elmer Fudde Elmer Fudd .8752 .7850 Not Likely
Furthermore, I see stuff like this:
Punch Villos Pancho villa .5865 .5 Possible Match.
I also have records in all of these groupings that work.
The approximate input record count is relatively small ( 170,000 records ) The input SQL is a query and the data is pre-sorted on _similarity and _confidence in desc order coming into the split. So the matching records should start at 1 & 1.
Any thoughts what I am missing?

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247222 More actions · Answer 1

It's a complex one to try and diagnose, because I don't think you've provided quite enough info.

My first comment would be that your conditions are not exclusive. Eg anything that is a good match is also a possible match and a least possible. Using 'between' logic would make things tighter.

JustMarie SSCertifiable Points: 7771 More actions · Answer 2

What data type is the field? It almost looks like a rounding error.

etm1109 Ten Centuries Points: 1107 More actions · Answer 3

The data types are the ones generated by SQL. I'm checking the values of the _similarity and _confidence fields and adding a field based on a number range.

I posted two examples of what should not be happening. The first example was not processed by the rules, and dropped into the default no match. The 2nd example should have not been processed by the rules but was.