Removing Repeating Data From a Fuzzy Match Result

Question

Removing Repeating Data From a Fuzzy Match Result

RonMexico

Hall of Fame

Points: 3474
More actions
August 30, 2017 at 8:53 am

#402446

I'm in the process of cleaning up a lookup table we have that is used to match on a name and provide a "clean" name based on that pairing. As you can imagine, since this is a process manually handled by someone, many of the "clean" names are becoming similar and should be combined together. I'm using fuzzy lookup in a SSIS package to get suggestions on names that appear to be similar. That all works great except the output provides a suggested match for a name and then, in return, lists the original name as a match for the suggested name. Hopefully the picture helps make that statement make a little more sense.

Since MIKES USED BIKES was a suggested name for MIKES BIKES I do not want MIKES BIKES to show up as a suggested name for MIKES USED BIKES. The desired result would be cleaning out the original table to look something like

Any suggestions on the best way to do this? I've been messing around with grouping the results and using some sort of rank so I can delete the unwanted rows but I'm not having any luck. Here is the how to create the sample data I'm using
create table Bikes ( Name varchar(50) ,SuggestedName varchar(50) )
insert into Bikes values ('MIKES BIKES', 'MIKES USED BIKES') ,('MIKES USED BIKES', 'MIKES BIKES') ,('WILD BIKE SALES', 'WILD BIKE SALES & SERVICE') ,('WILD BIKE SALES & SERVICE', 'WILD BIKE SALES') ,('BIKE PARTS & SERVICES', 'BIKES PART & SERVICE') ,('BIKES PART & SERVICE', 'BIKE PARTS & SERVICES') ,('OUTDOOR CYCLE', 'OUTDOOR CYCLERY ') ,('OUTDOOR CYCLERY ', 'OUTDOOR CYCLE') ,('JIMS CUSTOM BIKES', 'JIMS CUSTOM BIKES & SHOP') ,('JIMS CUSTOM BIKES & SHOP', 'JIMS CUSTOM BIKES')

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110640 More actions · Answer 1

Without any of your code to look at, I have no idea what you'll need to do. First order of business after you post your actual query, is to get the suggested values into their own table reference (possibly using some form of APPLY), and then eliminate any records where the suggested value is present in any of the original values. However, even that may depend on exactly how you generated the list you have.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

RonMexico Hall of Fame Points: 3474 More actions · Answer 2

What are you looking for when you say after I post my actual query? I'm using the fuzzy lookup transformation within SSIS to do the matching and generate the output table I provided above.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

Have you tried messing with LAG and LEAD?

SELECT *,

KeepMe = CASE WHEN LAG(SuggestedName,1) OVER(ORDER BY [Name], SuggestedName) = [Name] THEN 1 ELSE 0 END,

LoseMe = CASE WHEN LEAD(SuggestedName,1) OVER(ORDER BY [Name], SuggestedName) = [Name] THEN 1 ELSE 0 END

FROM Bikes

ORDER BY [Name], SuggestedName

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

RonMexico Hall of Fame Points: 3474 More actions · Answer 4

I can't say I've ever had to use them. However, it seems to do what I need it to upon a quick glance. I will definitely look into what each one does as this will be very helpful to know. Thank you!

Avi1 SSCrazy Points: 2323 More actions · Answer 5

Sample data is actually very clean. Why suggested name change, for clean data it is suggested data which is not clean and for dirty data a clean value is assigned. Is this the case you always have a clean value available in your table in name column. You might have to just improve your fuzzy logic, using DIFFERENCE and some other method. LAG and LEAD is also good option, but if you data that starts with similar value like MIKE and MIKES then chances of you getting correct data is less.
If you provide some more inputs then it could be helpful.

sgmunson SSC Guru Points: 110640 More actions · Answer 6

RonMexico - Wednesday, August 30, 2017 9:23 AM
What are you looking for when you say after I post my actual query? I'm using the fuzzy lookup transformation within SSIS to do the matching and generate the output table I provided above.

You're asking to filter out rows where you don't like the fuzzy match results. As you're using SSIS to do that, and you clearly don't like the results, just filtering out records isn't going to do much more than leave out source records with no trace of the original data in the result, so I have to start asking what your actual objective is, and why you would want to continue using an algorithm that doesn't produce a good result to begin with, and also leaves out certain source records.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

sgmunson - Thursday, August 31, 2017 7:22 AM
RonMexico - Wednesday, August 30, 2017 9:23 AM
What are you looking for when you say after I post my actual query? I'm using the fuzzy lookup transformation within SSIS to do the matching and generate the output table I provided above.
You're asking to filter out rows where you don't like the fuzzy match results. As you're using SSIS to do that, and you clearly don't like the results, just filtering out records isn't going to do much more than leave out source records with no trace of the original data in the result, so I have to start asking what your actual objective is, and why you would want to continue using an algorithm that doesn't produce a good result to begin with, and also leaves out certain source records.

It's not quite this, Steve. The results are similar to joining a table to itself on a fuzzy match. You get both sides in your result set. You can prevent this by joining on PK of one side < PK of the other, but I guess fuzzy transform isn't this sophisticated.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

sgmunson SSC Guru Points: 110640 More actions · Answer 8

ChrisM@Work - Thursday, August 31, 2017 7:28 AM
sgmunson - Thursday, August 31, 2017 7:22 AM
RonMexico - Wednesday, August 30, 2017 9:23 AM
What are you looking for when you say after I post my actual query? I'm using the fuzzy lookup transformation within SSIS to do the matching and generate the output table I provided above.
You're asking to filter out rows where you don't like the fuzzy match results. As you're using SSIS to do that, and you clearly don't like the results, just filtering out records isn't going to do much more than leave out source records with no trace of the original data in the result, so I have to start asking what your actual objective is, and why you would want to continue using an algorithm that doesn't produce a good result to begin with, and also leaves out certain source records.
It's not quite this, Steve. The results are similar to joining a table to itself on a fuzzy match. You get both sides in your result set. You can prevent this by joining on PK of one side < PK of the other, but I guess fuzzy transform isn't this sophisticated.

Yep. It's not. I'm also pretty sure I wouldn't want it to be that way though, unless you could control whether or not that particular functionality was active or not. Never destroy your inputs is one of the most basic principles of data management, and I just don't do that.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

RonMexico Hall of Fame Points: 3474 More actions · Answer 9

Steve - I get what you are saying now. Here I thought I was doing the analysts a favor by eliminating what seemed to be duplicate data. I was essentially trying to give them an option to choose A or B but that doesn't apply if A and B are two separate companies and should stay that way. Thanks for pointing that out!

ndharmatej Grasshopper Points: 22 More actions · Answer 10

I think with the below query the desired result can be achieved.

select b1.Name,b1.SuggestedName
from (
select *, row_number() over(order by name)as 'slno'
From Bikes )as b1
join (
select *, row_number() over(order by name)as 'slno'
From Bikes )as b2
on b1.Name = b2.SuggestedName
and b1.slno < b2.slno

Thanks,
Dharma