June 28, 2013 at 9:03 am
While I wouldn't trust the results without checking them by eye (or expect all cases to be found) DIFFERENCE can be used as a quick and dirty way of identifying a short list of possible duplicate records.
SELECT P1.*, P2.*
FROM People AS P1
INNER JOIN People AS P2
ON P1.Date_of_Birth = P2.Date_of_Birth
AND P1.Last_Name = P2.Last_Name
WHERE P1.Person_ID <> P2.Person_ID
AND DIFFERENCE(P1.First_Name,P2.First_Name) = 4
Fuzzy grouping through SSIS is a lot more sophisticated but the above query (or variants swapping first and last name) will give a quick idea of what you're facing very quickly.
June 28, 2013 at 9:05 am
BarneyL (6/28/2013)
While I wouldn't trust the results without checking them by eye (or expect all cases to be found) DIFFERENCE can be used as a quick and dirty way of identifying a short list of possible duplicate records.
I actually had to script something with a similar objective quite recently, and it didn't even occur to me to use this approach. I think I'll revisit that, thanks!
ron
-----
a haiku...
NULL is not zero
NULL is not an empty string
NULL is the unknown
June 28, 2013 at 10:03 am
Thanks Ron, awesome question.
----------------------------------------------------------------------------
Sacramento SQL Server users group - http://sac.sqlpass.org
Follow me on Twitter - @SQLDCH
----------------------------------------------------------------------------
June 28, 2013 at 12:47 pm
nice question..
Thanks..
June 28, 2013 at 12:57 pm
Thanks - really cool question.
- webrunner
-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
June 28, 2013 at 1:44 pm
ronmoses (6/28/2013)
It doesn't strike me as a particularly useful function, especially given these examples:
SELECT SOUNDEX('cent') --C530
SELECT SOUNDEX('scent') --S253
SELECT DIFFERENCE('cent','scent') --2, a very weak match
SELECT SOUNDEX('through') --T620
SELECT SOUNDEX('threw') --T600
SELECT DIFFERENCE('through','threw') --3, a fair match
--an attempt at regional diplomacy?
SELECT SOUNDEX('route') --R300
SELECT SOUNDEX('root') --R300
SELECT SOUNDEX('rout') --R300
SELECT DIFFERENCE('root','rout') --4, a perfect match
SELECT DIFFERENCE('root','route') --4, a perfect match
SELECT DIFFERENCE('rout','route') --4, a perfect match
--let's see if the US and Canada can put this to rest once and for all:
SELECT SOUNDEX('about') --A130
SELECT SOUNDEX('aboot') --A130
SELECT SOUNDEX('abowt') --A130
SELECT SOUNDEX('abeet') --A130
--all of these will DIFFERENCE with a result of 4, of course
So that seems pretty useless.
ron
Well, I guess it depends on what variety of English you speak. I imagine there really is somewhere where people pronounce English words the way soundex and difference seem to indicate, but it isn't anywhere I've ever been. If it has to be used outside of that place (wherever it is) your "pretty useless" seems rather accurate. But you made a good question out of it despite that, although it's maybe a bit too culture-specific for anyone who cares about the point. And I hadn't come across the four "about"s before I saw your comment.
Tom
June 28, 2013 at 5:27 pm
thanks for the question.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
June 28, 2013 at 6:01 pm
Fun question. I have never found a good use for these functions either. Your post with all the other examples clearly illustrates how useless it really is. I would like to see some cases where these functions can be used and actually relied on for accurate results. (Accurate to humans that is). Maybe Difference works better with some other languages?
June 29, 2013 at 7:38 am
BarneyL (6/28/2013)
While I wouldn't trust the results without checking them by eye (or expect all cases to be found) DIFFERENCE can be used as a quick and dirty way of identifying a short list of possible duplicate records.
SELECT P1.*, P2.*
FROM People AS P1
INNER JOIN People AS P2
ON P1.Date_of_Birth = P2.Date_of_Birth
AND P1.Last_Name = P2.Last_Name
WHERE P1.Person_ID <> P2.Person_ID
AND DIFFERENCE(P1.First_Name,P2.First_Name) = 4
Fuzzy grouping through SSIS is a lot more sophisticated but the above query (or variants swapping first and last name) will give a quick idea of what you're facing very quickly.
Although I did not know about the DIFFERENCE function, I have been using SOUNDEX for many years, for the same purpose as above. We always review possible matches, but it has been very useful.
June 30, 2013 at 4:36 am
ronmoses (6/28/2013)
It doesn't strike me as a particularly useful function, especially given these examples:
SOUNDEX and DIFFERENCE can be useful, but you have to know their specifications. Here is a link to the SOUNDEX algorithm:
http://en.wikipedia.org/wiki/Soundex
As you see, vowels (except when in the first position) are ignored. So the title of the post is misleading, the perfect match between SQL and sequel is caused by them being the same after removing all vowels, not because of anything pronounciation related.
SELECT DIFFERENCE('SQL','soooooooooquiiiiiaaaaaaaoooiiuueeoioueel')
June 30, 2013 at 10:33 am
Hugo Kornelis (6/30/2013)
ronmoses (6/28/2013)
It doesn't strike me as a particularly useful function, especially given these examples:SOUNDEX and DIFFERENCE can be useful, but you have to know their specifications. Here is a link to the SOUNDEX algorithm:
http://en.wikipedia.org/wiki/Soundex
As you see, vowels (except when in the first position) are ignored. So the title of the post is misleading, the perfect match between SQL and sequel is caused by them being the same after removing all vowels, not because of anything pronounciation related.
SELECT DIFFERENCE('SQL','soooooooooquiiiiiaaaaaaaoooiiuueeoioueel')
It doesn't just ignore vowels (except the first letter of the word), the treatment of consonants isn't very helpful either.
SELECT DIFFERENCE('sql','sjzcgjkqsxzlll')
Anyway, this question is clearly about soundex, a function that's been around a lot longer than SQL has, so if this is a fair QOTD would questions about mathematical functions that are used in SQL be fair game? I suspect they would be allowed by the editor if someone produced something as amusing as this question; but it's hard to think of amusing questions about SQRT or POWER. It's fairly easy to produce something that looks amusing but actually isn't (because the only sensible way to answer it is run the code, it's far too difficult to do by hand: for example if you run
select round(sqrt(6),2) as A, round(power(88.27,0.2),2) as B, round(power(88.3,0.2),2) as C,
round(power(89.17,0.2),2) as D, round(power(87.40,0.2),2) as E, round(sqrt(6),3) as Fwhich columns in the resulting row have the highest and lowest numbers in them? C has the highest - a side-effect of the target type for the implicit conversion, I believe, although BOL seems to disagree, and F the lowest (that's because of the rounding). Of course an even worse (because it's far too hard and you can't just run the code) question could be devised for soundex, eg how many words in standard English have soundex code 520 (possible answers 0, 3 /2/1, 4/5/6, 7/8/9, 10/11/12, 13, 14, and 15, and 16 or more; it's certainly at least 10, but working out whether it's more than 15 would probably require a search through the h section of the OED; including proper nouns like Hanse (from which we get the adjective "hanseatic", which of course has a different soundex code) would lead to objections from people who don't realise that soundex was originally designed specifically for proper nouns so it would be bizarre to exclude them, while excluding them would annoy those who are familiar with soundex and its uses; and the cultural bias in favour of peope with native or bilingual fluency in English in a question like that would be pretty extreme.
Tom
June 30, 2013 at 11:31 am
Hugo Kornelis (6/30/2013)
So the title of the post is misleading
Not intentionally so, of course 🙂
Thank you for the added information, those results make perfect sense now.
ron
-----
a haiku...
NULL is not zero
NULL is not an empty string
NULL is the unknown
July 1, 2013 at 12:25 am
Nice one......
July 1, 2013 at 12:27 am
Good one...+1
July 1, 2013 at 5:54 am
Hugo Kornelis (6/30/2013)
ronmoses (6/28/2013)
It doesn't strike me as a particularly useful function, especially given these examples:SOUNDEX and DIFFERENCE can be useful, but you have to know their specifications. Here is a link to the SOUNDEX algorithm:
Armed with this knowledge, I just updated a script I'd been using for several months and improved its effectiveness immeasurably. So I'm very glad I ignorantly threw this question out to the world. (Especially to Hugo.) 😀
thanks!
ron
-----
a haiku...
NULL is not zero
NULL is not an empty string
NULL is the unknown
Viewing 15 posts - 16 through 30 (of 43 total)
You must be logged in to reply to this topic. Login to reply