Textfield comparator.

Question

Textfield comparator.

ben.brugman

SSChampion

Points: 13369
More actions
October 26, 2020 at 2:35 pm

#3801350

A client would like to be able to recognize similar free-texts in a text field. So based on a field find similar texts in the same column. Result could be something like a percentual equalness.
Could anybody point me towards a SQL-server geared solution. Or "search words" for Google for this.
The free-texts are recipe like. Language will be Dutch in the fields.
Ben

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

DesNorton SSC-Insane Points: 24386 More actions · Answer 1

I believe that you are looking for fuzzy matches. I have never had to do it before, but this looks like it might get you started

https://www.google.com/search?rlz=1C1GCEA_enZA912ZA912&sxsrf=ALeKk00uYiKdhg1PIAY7XR2lKfHio_Q_cg%3A1603725887965&ei=P-qWX-O5OsuM8gKosZMo&q=sql+fuzzy+match&oq=sql+fuzzy&gs_lcp=CgZwc3ktYWIQAxgBMgUIABDJAzICCAAyAggAMgIIADICCAAyAggAMgIIADIHCAAQFBCHAjICCAAyAggAOgcIABBHELADOgQIIxAnOggIABDJAxCRAjoFCAAQkQI6BAgAEEM6BwgAEMkDEENQnf8EWPCKBWCwsgVoAXAAeACAAccDiAHhDJIBBzItMi4yLjGYAQCgAQGqAQdnd3Mtd2l6yAEIwAEB&sclient=psy-ab

How to post data/code on a forum to get the best help.

ben.brugman SSChampion Points: 13369 More actions · Answer 2

Thank you DesNorton,

It goes beyond Fuzzy Matches. It is not only 'single' fields like name address etc. which have to be compared, but complete 'free texts', it's a kind of recipe that has to be matched. So yes it's a match, and yes it is very fuzzy, but the fuzzy match algorithms a geared to find quit simple 'mistakes' in spelling etc., this goes beyond that.

I know there are algorithms which find 'similar' texts in student texts to detect copying. But have no idea if these techniques can or are used within relational databases. And for my application the search stays within the same database (same column probably).

Thanks for your input,

Ben

Jeffrey Williams SSC Guru Points: 90303 More actions · Answer 3

Are you thinking of Levenshtein Distance? Here is a definition: https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.

This can be done in pure SQL - but it would probably be easier to implement in Python or as a CLR function.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

ratbak Ten Centuries Points: 1358 More actions · Answer 4

https://github.com/GuerrillaAnalytics/similarity is an updated version of the SimMetrics CLR assembly I've used for years.

ben.brugman SSChampion Points: 13369 More actions · Answer 5

Jeffrey and ratbak, thanks for your pointers.

Sorry for this late response. The question came up because of a wish and not a hard requirement. And something came up with a high priority, so I had to work on that. But I should have thanked you for your time and attention, but this didn't have my attention at that moment. I should have been more polite and considerate toward you (sorry).

Both answers could be the start of a solution. (Maybe substiting some words which are equivalent or similar). And then using the algoritm. And then finding the texts with the 'shortest' Levenshtein Distance. Or finding groups where the Levenshtein Distance is shortest between the texts.

For your information the texts are recipe's for preparing medication, so equivalence is (big question mark) is more based on understanding the recipe than on actual equivalence in writing.

Both techniques are new to me, so I'll first have a further look into the your links. And see if I can come up with an implementation, without spending to much time on this. (As said it is not a requirement, but would be a nice to have). And then see if this produces a workable situation.

Thanks again,

Ben