November 26, 2018 at 6:30 am
Sergiy - Sunday, November 25, 2018 6:16 PMMust be a fascinating technical thing which is absolutely useless for actual users.
May be even worse - damaging for real life applications.
Soundex was invented as a manual substitution code to help office clerks find names on a Rolodex card system. It predates electronic computers by a good few decades. As you say, it is extremely vague and as I said in the original articles it produces around 8000 combinations to represent the millions of English words.
Most of the algorithms have something to do with finding the names of people so their scope is limited. Lawrence Philips metaphone algorithms go beyond simple name matching. Metaphone3 is a commercial algorithm which significantly improves match rates while diminishing false positives.
Not sure who you consider to be "users" so the only comment I can offer as to its "absolutely useless" nature is that it improves the performance of a search facility in connecting people to products for which they wish to exchange money. People don't buy what they can't find,
No idea what you mean by "damaging for real life applications".
November 26, 2018 at 6:43 am
simon.hewitt - Monday, November 26, 2018 4:24 AMSergiy:
Most of David's code is general C# so can be used at the front end. The code to use it as back-end is only a few additional lines.For the problem you describe, no algorithm will be able to do what you want I suspect.
My solution (again for front end) is simple: I create an alias list containing the 'real' or 'known' text and a list of known aliases.
Yep, that's a pretty obvious solution, and, of course, I used this approach back then.
There was also a protocol for collecting "unrecognized" variations within data uploads and inform users about new names to be allocated.
As for front/back end - this solution (I mean the approach, not the implementation) is to unreliable to be used anywhere
Just not good enough.
It must contain way more intelligence to be trusted.
It must have a way to figure out that "Lond" must be rather mistyped "Pond" (letters .P and L are pretty close) than "Lord" or "London"or "Long" which are way more closer by the sound of it.
_____________
Code for TallyGenerator
November 26, 2018 at 7:05 am
David.Poole - Monday, November 26, 2018 6:30 AMNot sure who you consider to be "users" so the only comment I can offer as to its "absolutely useless" nature is that it improves the performance of a search facility in connecting people to products for which they wish to exchange money. People don't buy what they can't find,No idea what you mean by "damaging for real life applications".
I don't think it's actually used in popular search engines.
Type the sequence of symbols "cthdth" into Google Search inbox - it will return you articles about "Ñервер", which is Russian for "server", because that's how English/Russian keyboards are mapped. And there would not be any suggestion based on SOUNDEX mapping.
Using it may be damaging, if placed in some automated interface.
And because you posted it on SQL Server site it's natural to assume the method to be used in a back end, within some automated process.
I saw the consequences of mistakes made by this kind of methods, so I can tell you - it may be very much damaging.
_____________
Code for TallyGenerator
November 26, 2018 at 8:46 am
Sergiy - Monday, November 26, 2018 7:05 AMType the sequence of symbols "cthdth" into Google Search inbox - it will return you articles about "Ñервер", which is Russian for "server", because that's how English/Russian keyboards are mapped. And there would not be any suggestion based on SOUNDEX mapping.Using it may be damaging, if placed in some automated interface.
And because you posted it on SQL Server site it's natural to assume the method to be used in a back end, within some automated process.
I saw the consequences of mistakes made by this kind of methods, so I can tell you - it may be very much damaging.
Can you give an example of the automated interface you have in mind and what the scenario was where the consequences were damaging? Context is all.
ElasticSearch and Solr both give options to use a range of phonetic algorithms.
November 26, 2018 at 12:47 pm
I gave an example in the post above.
Recognising buyer names in suppliers' data feeds.
If done wrong - invoices generated for wrong customers, processed according to inappropriate rules in a wrong context, payments done by wrong organisations to wrong organisations.
Considering - big organisations (like universities) pay only once a month, - the financial penalties (fees and interest) would be significant, in our case they could easily exceed the service fee they paid to us.
And they would probably start looking for another service provider.
Is it damaging enough?
_____________
Code for TallyGenerator
November 26, 2018 at 2:15 pm
Sergiy - Monday, November 26, 2018 12:47 PMI gave an example in the post above.Recognising buyer names in suppliers' data feeds.If done wrong - invoices generated for wrong customers, processed according to inappropriate rules in a wrong context, payments done by wrong organisations to wrong organisations. Considering - big organisations (like universities) pay only once a month, - the financial penalties (fees and interest) would be significant, in our case they could easily exceed the service fee they paid to us.And they would probably start looking for another service provider.Is it damaging enough?
Yes but I can't imagine anyone being daft enough to use it like that. It's for searching for things like holidays in Fuerteventura or pre raphaelite wallpaper.
December 1, 2018 at 8:23 am
David.Poole - Monday, November 26, 2018 2:15 PMYes but I can't imagine anyone being daft enough to use it like that. It's for searching for things like holidays in Fuerteventura or pre raphaelite wallpaper.
OK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
To me - it looks like extremely rare use case.
_____________
Code for TallyGenerator
December 1, 2018 at 3:08 pm
Sergiy - Saturday, December 1, 2018 8:23 AMOK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
To me - it looks like extremely rare use case.
The algorithms are deterministic so values can be precomputed. its as fast as any other index on a text based term such as postcode
Levenshtein, on the other hand is a bit of a sod.
December 1, 2018 at 7:13 pm
David.Poole - Saturday, December 1, 2018 3:08 PMSergiy - Saturday, December 1, 2018 8:23 AMOK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
To me - it looks like extremely rare use case.The algorithms are deterministic so values can be precomputed. its as fast as any other index on a text based term such as postcode
Levenshtein, on the other hand is a bit of a sod.
You know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
There are even hard drives for sale with tables matching words with hashes.
Work is still in progress - the servers are still to go through all word variations.
_____________
Code for TallyGenerator
December 2, 2018 at 4:44 am
Sergiy - Saturday, December 1, 2018 7:13 PMYou know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
There are even hard drives for sale with tables matching words with hashes.
Work is still in progress - the servers are still to go through all word variations.
Yes, there's a YouTube video somewhere about someone using GPUs to brute force passwords. The conclusion from that is that anything that relies purely on an encrypted password is dead in the water. Hackers just need to win once, security guys need to win every time. This is made even more difficult when the vulnerability might be in a connected system rather than the one you have control over.
Then there are those who persist in building their own encryption and then use it to obfuscate production data. I'm glad i'm not the compliance officer fighting that battle
December 2, 2018 at 6:02 am
David.Poole - Sunday, December 2, 2018 4:44 AMSergiy - Saturday, December 1, 2018 7:13 PMYou know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
There are even hard drives for sale with tables matching words with hashes.
Work is still in progress - the servers are still to go through all word variations.Yes, there's a YouTube video somewhere about someone using GPUs to brute force passwords. The conclusion from that is that anything that relies purely on an encrypted password is dead in the water. Hackers just need to win once, security guys need to win every time. This is made even more difficult when the vulnerability might be in a connected system rather than the one you have control over.
Then there are those who persist in building their own encryption and then use it to obfuscate production data. I'm glad i'm not the compliance officer fighting that battle
My point is - those tables of hashes being sold and delivered on physical hard drives, last time I checked it was 3 or 4 TB of data. Just for the table of 2 columns with certain and simple correspondence between the values.
For SOUNDEX you'll have multiple matches, with different levels of similarity. so the database will be more complicated and bigger in size.
I don't think indexing such a massive database is a winning game.
_____________
Code for TallyGenerator
December 2, 2018 at 2:36 pm
Wait until i write up the experiment for which I went to all this trouble to build these functions. All will become clear
Viewing 12 posts - 16 through 26 (of 26 total)
You must be logged in to reply to this topic. Login to reply