Complex partial string matching

Question

Complex partial string matching

alistair_hayter

SSC Enthusiast

Points: 148
More actions
April 2, 2008 at 6:13 am

#188643

Hi,
I need to match records from two data sources. None of the records have exact matches. Most are like the follwoing example;
Record 1
"Garden' Bot'icals", "Large Tub", "500"
Record 2
"Gardeners Botanicals", "Tub", "500 litres"
I need to match any part of value 1 with any part of value 2.
T-SQL LIKE appears to be a bit to limited for this while Full Text Searching seems overkill and I'm not sure it is designed for this anyway.
I'm sure there's a simple way to do this and appreciate any advice on the best approach.
I'm using SQL Server 2005 and have limited knowledge of T-SQL and SPs.
Thanks,
Alistair.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

This is actually a significantly complex subject.

First, you have to define rules for what "matches any part". For example, "Tub" and "Target" both have the letter "t" in them, but that's obviously going too far. So, how about "Tube" and "Tug Boat"? Both have "tu" in them. Again, not a good match. "Tube" and "Tuba"? 3-letter parallel, but still probably not a good match. So, you have to be more precised about the rules, beyond "any part matches any other part". What is the cut-off?

SSIS in Enterpise Edition has some fuzzy matching code. You might explore that a bit. I haven't used it, I've just read that it exists, so I can't give you more of a pointer than that one.

If what you want is to use T-SQL, first establish some rules, then test the heck out of them.

Remember, SQL doesn't have the faintest amount of intelligence nor any judgement. It will match "Tube Socks" and "Tug Boat" if you tell it to. (Both start with "tu" and both have an "o" as the second letter of the second word.)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

VIPS SSC Enthusiast Points: 104 More actions · Answer 2

Try SOUNDEX... split each word and find it both ways. I have used SONDEX but not in this exact manner. Let us know what worked.

GSquared SSC Guru Points: 260824 More actions · Answer 3

Soundexes for "Gardeners" and "Garden'" are the same (G635), but soundexes for "Botanicals" and "Bot'icals" aren't the same (B352 vs B300). It might help narrow down the similarities, but won't solve the issue here.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

VIPS SSC Enthusiast Points: 104 More actions · Answer 4

Yes, I agree its not the perfect solution. So depending the degree of perfection, it may or may not be used.

steveb. SSC-Forever Points: 46733 More actions · Answer 5

As the others have mentioned this could be a complex task,

How many rows of data are you trying to match?

Depending on the number of items it may help to setup a table with the correct spelling and match them both to this.

If both of the sets of data have a number in them, for example 500. you could match on this first to narrow down the results and then match on the string.

Depending on the data you will have to test and combine different methods of matching to achieve the desired result.

alistair_hayter SSC Enthusiast Points: 148 More actions · Answer 6

gsquared/VIPS, thank you for the replies. They've told me exactly what I needed to know which is that this is not a trivial task. I'll look into SSIS and SOUNDEX to see if/how these can help.

SteveB - I need to match approx. 6000 records. Each record has 5 fields that need to be matched. It sounds like this may be more suited to manual (human) rather than automation.

I'll let you know if I manage to automate any part of the process with SSIS, soundex etc.

Thanks again,

Alistair

GSquared SSC Guru Points: 260824 More actions · Answer 7

If you set up an "equivalances" table, with "Gardeners Botanicals" in a column called "Correct", and "Gardener' Bot'icals" in a column called "Variation", then you could have a person build a list of the variations, and use that to clean the data, or to leave the data as-is and use the new table as an intermediary for joins.

Will require human judgement somewhere along the way for this project.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Jacob Luebbers SSCarpal Tunnel Points: 4834 More actions · Answer 8

A CLR-based approach to this may be best - build a work-breaker function that takes your Source A and Source B strings as input. I'm making an assumption that the apostrophe char is used soly as an abbreviation indicator. The function would:

1. Break each column into an array of word tokens (split on whitespace).

2. Any token that contains an apostrophe is truncated at the first apostrophe instance.

3. Iterate through each token in Source A making a LIKE-style (starts with) comparison to every token in Source B.

4. Any matches found return true, else return false.

Not T-SQL at all I'm afraid, and potentially quite slow. Would also still need a human to double-check the output, but should at least narrow down the output.

EDIT: You could actually implement this as a UDF or two in TSQL, but it would be horrible performance-wise.

Regards,

Jacob