August 15, 2005 at 8:27 am
Hi,
I downloaded the update, and replaced the dll's. I guess that's all I need to do...
I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others...
I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine...
Well, it does UNLESS you have punctuation characters such as ' (as in O'brian) ...
declare @myname varchar(255)
declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'o''brian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O000)
select @mydex2 -- (returns OBRAN)
however
declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'obrian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O165)
select @mydex2 --( returns OBRAN )
So, soundex does NOT eliminate the punctuation errors, but NYSIIS does.
Just fyi...
I guess I will be researching some more "optimization"
Thank you again...
August 15, 2005 at 8:59 am
That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service).
That's a very good point. The handling of punctuation is implementation-specific. There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms. (This is really apparent with newer algorithms, like Double Metaphone!) It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark. You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN').
The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether. One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX. This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++).
I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table. That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries. All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently. The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more. For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings. Proper non-clustered indexes will take up additional space, of course.
You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example).
Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter.
SELECT * FROM People
WHERE Name LIKE 'J%'
AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name)
Thanks!
March 8, 2006 at 9:45 am
Hi all,
I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them.
in proc.cpp in the xp_levenshtein dll you need to add the lines
delete P1;
delete P2;
delete P3;
before the line
return sqlerr;
and in proc.cpp in the xp_nysiis project likewise, you need to add the lines
delete P1;
delete P2;
before the same line;
return sqlerr;
Hope that helps someone, as the provided dlls helped bring down one of our production sql servers!
All the best,
Leon
March 8, 2006 at 5:53 pm
Hi Leon,
Thanks for pointing that out. I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version. I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory).
Thanks again for the feedback, and sorry for the inconvenience.
July 13, 2006 at 10:05 pm
This is exactly what I've been looking for; thank you. However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit. Should I be compatable with the DLL's and toolkit?
Thanks again...
July 13, 2006 at 10:39 pm
Oh my... I haven't tested any of these on 64-bit Win2003 or SQL 2005. They were only tested on SQL 2000, 32-bit. XP's on SQL 2005 have been deprecated but are still allowed... for now. A better bet might be to convert the routines to .NET assemblies for SQL 2K5. The downloads contain the C++ source code for these routines. If you were using SQL 2000, I'd recommend using the routines here: http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp, with more optimizations and additional routines added.
August 14, 2006 at 2:59 pm
Um... I checked the example given under the NYSIIS and they don't all return JANSAN
Johnson J525 JANSAN
Johnsen J525 JANSAN
Johanson J525 JAHANSAN
Johannsen J525 JAHANSAN
September 15, 2006 at 2:54 pm
Well, this is only a tool to help you out filtering better results than the crappy soundex function, if you make than many mistakes writing down your name, then its normal it doesnt return anything.
If I look for "De la pointe" using soundex, writing: "De la ponte",
it will return a LOT more results than it should...
basically everything starting with DE[SPACE]... will be returned!
Wow.
Will test NYSIIS and let you know how better it is or not.
September 15, 2006 at 5:41 pm
The code was changed to reflect the NYSIIS "standard" a little better after the article was published (see this discussion board for details). The "H" in Johanson and Johannsen is not considered silent because it's surrounded by vowels; therefore it's not dropped. In the original version it was dropped.
September 15, 2006 at 5:51 pm
Yes, unfortunately the SQL Server SOUNDEX() function stops encoding a name when it encounters a non-alphabetic character. So "DE LA POINTE", "DE LA SOUL", etc. will all encode to "D000". One work-around with Soundex is to eliminate all spaces from a name before encoding with the REPLACE() function:
SELECT
SOUNDEX(REPLACE('DE LA POINTE',' ', ''))
Another option (that NARA recommends) is to eliminate common prefixes like 'DE', 'LA', 'VAN', 'CON', etc. from the name before Soundex-encoding it. See http://www.archives.gov/publications/general-info-leaflets/55.html for more info. Also note that the MS SOUNDEX() function does not follow the official NARA Soundex standard, and may return results that don't match up with external data sources that encode correctly.
This NYSIIS implementation ignores invalid characters (spaces, etc.) in the name and encodes until it runs out of alphabetic characters. Also see http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp for the update to this set. The update includes NYSIIS, Double-Metaphone, Celko Soundex, Daitch-Mokotoff Soundex, Levenshtein Edit Distance and Jaro-Winkler Distance functions.
April 13, 2009 at 6:25 pm
Hi,
Having issues registering the nyssis.dll file on Windows server 2008.
Is there Any updated version of this file is available?
or please tell me how i can use it on windows server 2008 operating system.
Thanks in advance
ravi
October 27, 2009 at 3:46 pm
I have had this implemented for about 2 years on a 32-bit sql 2000 server and it's working like a champ. However, we are now implenting on a sql 2005 64-bit server, and it is blowing up on me. My event viewer shows the following error:
"Description:
Could not load the DLL xp_nysiis.dll, or one of the DLLs it references. Reason: 193(xp_nysiis.dll is not a valid Win32 application.)."
Please help.
May 31, 2011 at 10:20 am
Hello All -
Does anyone have any solutions on how to get NYSIIS.DLL to run on a 64-Bit machine ??
December 20, 2013 at 3:02 pm
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next. Unfortunately I've found inconsistent results with some of the online tools that I've been using to test this (http://www.dropby.com/NYSIIS.html). Anyway... it's a start for those that want to use it:
Declare
@Word varchar(50),
@FirstCharacter char(1),
@Length int,
@LetterPosition int,
@PreviousPosition int,
@NextPosition int,
@IteratedWord varchar(50)
SET @Word = 'Alexander'
SET @Word = replace(@Word,' ','')
SET @Word = rtrim(ltrim(@Word))
SET @Word = Upper(@Word)
SET @Length = len(@Word)
/*
1.Transcode first characters of name:
MAC»MCC
KN»NN
K»C
PH»FF
PF»FF
SCH»SSS
*/
IF left(@Word,3) = 'MAC' BEGIN SET @Word = 'MCC' + right(@Word,@Length-3) END
IF left(@Word,2) = 'KN' BEGIN SET @Word = 'NN' + right(@Word,@Length-2) END
IF left(@Word,1) = 'K' BEGIN SET @Word = 'C' + right(@Word,@Length-1) END
IF left(@Word,2) in ('PH','PF') BEGIN SET @Word = 'FF' + right(@Word,@Length-2) END
IF left(@Word,3) = 'SCH' BEGIN SET @Word = 'SSS' + right(@Word,@Length-3) END
PRINT '1: ' + @Word
/* 2.Transcode last characters of name:
EE, IE»Y
DT,RT,RD,NT,ND»D */
IF right(@Word,2) in ('EE','IE') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'Y' END
IF right(@Word,2) in ('DT','RT','RD','NT','ND') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'D' END
SET @Length = len(@Word)
PRINT '2: ' + @Word
/*3.First character of key = first character of name. */
SET @FirstCharacter = left(@Word,1)
PRINT '3: ' + @FirstCharacter
--Set @Word to remaining characters for further processing
SET @Word = right(@Word,@Length-1)
SET @Length = len(@Word)
PRINT 'Trimmed of First Character: ' + @Word
/*
4.Transcode remaining characters by following these rules, incrementing by one character each time:
EV»AFelse A,E,I,O,U » A
Q»G
Z»S
M»N
KN»Nelse K » C
SCH»SSS
PH»FF
H»If previous or next is nonvowel, previous
W»If previous is vowel, previous
Add current to key if current != last key character
*/
SET @Word = replace(@Word,'EV','AF')
SET @Word = replace(@Word,'E','A')
SET @Word = replace(@Word,'I','A')
SET @Word = replace(@Word,'O','A')
SET @Word = replace(@Word,'U','A')
SET @Word = replace(@Word,'Q','G')
SET @Word = replace(@Word,'Z','S')
SET @Word = replace(@Word,'M','N')
SET @Word = replace(@Word,'KN','N')
SET @Word = replace(@Word,'K','C')
SET @Word = replace(@Word,'SCH','SSS')
SET @Word = replace(@Word,'PH','FF')
--Iterate through each letter for "H" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END
IF substring(@IteratedWord,@LetterPosition,1) = 'H' and
(substring(@IteratedWord,@PreviousPosition,1) not in ('A','E','I','O','U') or
substring(@IteratedWord,@NextPosition,1) not in ('A','E','I','O','U'))
BEGIN
PRINT 'STUFF H!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END
SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END
--Iterate through each letter for "W" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END
SET @Length = len(@Word)
IF substring(@IteratedWord,@LetterPosition,1) = 'W' and
substring(@IteratedWord,@PreviousPosition,1) in ('A','E','I','O','U')
BEGIN
PRINT 'STUFF W!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END
SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END
/* 5. If last character is S, remove it */
IF right(@Word,1) = 'S'
BEGIN
PRINT 'Remove last letter S'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '5: ' + @Word
/* 6.If last characters are AY, replace with Y */
IF right(@Word,2) = 'AY'
BEGIN
PRINT 'Remove last characters AY'
SET @Word = stuff(@Word,@Length-1,2,'Y')
SET @Length = len(@Word)
END
PRINT '6: ' + @Word
/* 7.If last character is A, remove it */
IF right(@Word,1) = 'A'
BEGIN
PRINT 'Remove last letter A'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '7: ' + @Word
/* 8.Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END
IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word
/*9.Add original first character of name as first character of key*/
SET @Word = @FirstCharacter + @Word
PRINT '9: ' + @Word
--Only supposed to use first six characters but display the rest
PRINT left(@Word,6) + Case When @Length > 6 Then '['+ substring(@Word,7,@Length-5)+']' Else '' END
December 20, 2013 at 6:42 pm
AVB (12/20/2013)
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next.
I'm glad to see someone working on this outside the world of CLR even if I think the CLR will always win performance-wise.
Something to possibly consider. While loops tend to be a bit sluggish in T-SQL. The loops in your code might be an exception because they're "memory only" loops. However, please consider the following. Here's your snippet for getting rid of multiple adjacent identical characters.
/* 8.Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END
IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word
Here's a "Tally Table" version that does the same thing.
--===== Remove all duplicated characters (JBM)
DECLARE @Word VARCHAR(1000);
SELECT @Word = 'AAAAABBBBCDEFFFFFFFFFFFFFFGHHHHHHHHIIIIIIIJJKLMNNNNNNNNNNO';
SELECT @Word =
(
SELECT SUBSTRING(@Word,t.N,1)
FROM dbo.Tally t
WHERE t.N <= DATALENGTH(@Word)
AND SUBSTRING(@Word,t.N,1) <> SUBSTRING(@Word,t.N-1,1)
ORDER BY t.N
FOR XML PATH(''),TYPE
).value('(./text())[1]','varchar(max)')
;
PRINT @Word;
Of course, you do lose the capability to the the individual actions taken against each and every character but I suspect you won't want that in the final product anyway.
Looking through the code, there are several places where you can convert the code to similar functionality or even combined prior to the de-duplication of characters.
You can find out more about the Tally Table and how it can be used as a replacement for certain kinds of loops at the following URL.
http://www.sqlservercentral.com/articles/T-SQL/62867/
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 37 total)
You must be logged in to reply to this topic. Login to reply