April 4, 2013 at 7:26 am
Hi All,
Is there any built in function available for checking the similarity of the phrases
eg: US Eastern District Court is similar to Eastern District Court of US.
So if I am comparing both i should get result 1.
I need this for checking the duplicate entries in a table having so much data, so that we can delete the duplicates.
Thanks In Advance...
Thanks
April 4, 2013 at 7:36 am
two things come to mind -
full text indexing
and if you have enterprise edition fuzzy grouping in SSIS advanced data flow transformations.
---------------------------------------------------------------------
April 4, 2013 at 8:03 am
Your question is almost verbatim the MSDN definition of SOUNDEX. http://msdn.microsoft.com/en-us/library/ms187384.aspx
Now with the two phrases you listed the SOUNDEX would not be even close because the order of the words is different. We can leverage the age old DelimitedSplit8K here quite nicely. The idea here is order all words in each phrase alphabetically so we have a consistent order of the words for SOUNDEX.
if OBJECT_ID('tempdb..#List') is not null
drop table #List
create table #List
(
ListID int identity,
Phrase varchar(50)
)
insert #List
select 'US Eastern District Court'
union all
select 'Eastern District Court of US'
;with List as
(
--First we need to split the values on the words
select * from #List
cross apply dbo.DelimitedSplit8K(Phrase, ' ')
)
, stuffedList as
(
--Now we sort the words alphabetically
select ListID, STUFF((select Item + ' '
from List l2
where l2.ListID = l1.ListID
order by Item
for XML PATH('')), 1, 0, '') as FullList
from List l1
group by ListID
)
select *, SOUNDEX(FullList) from stuffedList
Please see the link in my signature about splitting strings. In there you will find the code and the logic for how the DelimitedSplit8K function works.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 5, 2013 at 3:19 am
Thanks Sean for your post..
For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..
Court District Eastern US C630
Court District of Southern US C630
So how can I avoid this?
Thanks
April 5, 2013 at 7:16 am
deepzzzz (4/5/2013)
Thanks Sean for your post..For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..
Court District Eastern US C630
Court District of Southern US C630
So how can I avoid this?
SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.
If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 5, 2013 at 7:39 am
Jeff Moden (4/5/2013)
deepzzzz (4/5/2013)
Thanks Sean for your post..For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..
Court District Eastern US C630
Court District of Southern US C630
So how can I avoid this?
SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.
If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.
I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 5, 2013 at 9:36 am
Check out the following post that I started last July: http://www.sqlservercentral.com/Forums/Topic1337370-391-1.aspx
Chris@Home kindly provided a really good token matching algortihm and it's ultra fast as it's implemented as iTVF.
SOUNDX is rubbish. Alternative to the Chri's algorithm in the above post . Also Metaphone, Double Metaphone (both free) and Metaphone3 which I purchased for $40 recently and they're all really good especially with US English/British words.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 8, 2013 at 6:20 pm
Sean Lange (4/5/2013)
I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.
I'm probably not giving it enough credit for whatever its intended use may have been but I've really been disappointed by it. From BOL...
[font="Arial Black"]Vowels are ignored in the comparison. Nonalphabetic characters are used to end the comparison.[/font]
In other words, a dash or a space or a digit or just about any punctuation will stop the comparison. That means either it does just the first word or first part of a hyphenated word.
These are probably bad examples of what I don't like about it, but it's the best I can do on short notice.
SELECT DIFFERENCE('Glif', 'Geoffrey');
SELECT DIFFERENCE('Glare', 'Geoffrey');
Because none of the vowels (including the letter "y" in this case), are considered, these two comparisons are both given a "3 out of 4" as a match. That's usually nowhere near close enough for my purposes.
Then take the following example which is just about spot on.
SELECT DIFFERENCE('Geoffrey', 'Mr Geoffrey');
That's only given a "2" because only the "Mr" of the second operand is evaluated because any non-alphabetic character, including a space, will stop the comparison.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 8, 2013 at 11:58 pm
Whatever clever function you may find/create the phrase
'US Eastern District Court'
will still be matching
'US Western District Court'
better than
'United States Eastern District Court'
It's a lot of general knowledge of yours which allows you to match a row to this row but not to that one.
Unless you pass that knowledge to you database (yep, creating tables, storing samples and patterns) you will always program errors.
_____________
Code for TallyGenerator
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply