Soundex - Experiments with SQL CLR
Occasionally something you read causes you to radically change your way of working. There are two books that have had such a profound effect on me.
- The Art of Agile Development by James Shore and Shane Warden
- Beginning Visual C++ 4 by Ivor Horton
The former really switched me on to agile development and the latter, (although I have only ever used C++ in hobby projects) gave me a much deeper understanding of what goes on under the hood within applications.
I recently found I needed to use the SOUNDEX function and although the SQL Server implementation of soundex follows a standard I really needed something with a bit more oomph. Could I combine what I learnt in both books to product a better SOUNDEX function?
Strings and efficiency
Under the covers a string is simply a contiguous array of bytes. If you want to append strings together then your computer has to work out how long both strings are, reserve that amount of contiguous space, and then copy the concatenated strings into that new space before releasing the original memory. If you have to do a lot of concatenation then this is a very expensive operation.
In .NET and Java you have a StringBuilder class. This postpones the problem of concatenation until the final string is requested. It does this by maintaining a list of pointers to the various strings in the array and assembling the final string when the ToString() method is called.
In Ivor Horton's book he has an exercise where you build a simple command line calculator. One of the functions of that calculator is to interpret what has been typed in and the first part of this is to remove any spaces from within the formula. He achieved this by writing a simple function called eatspaces()
// Function to eliminate spaces from a string void eatspaces(char* str) {int
i(0);// 'Copy to' index to string
int
j(0);// 'Copy from' index to string
while
((*(str + i) = *(str + j++)) != '\0')// Loop while character is not \0
if
(*(str + i) != ' ')// Increment i as long as
i++;// character is not a space
return
; }
The full explanation of the code is covered by the book but in brief it works by keeping two position-in-string variables, i and j . Variable j increments every time the routine goes around the loop. Variable i increments only when the character pointed to in the next position to the 2nd counter is not a space. This routine does not require any additional memory locations for manipulating the string and so is extremely efficient.
If you are an old VB programmer you will probably remember doing something similar in VB. You would start by assign a certain number of characters to a string. Then you would write mid(myString,x,1)="<your character>". This was far faster than writing TargetString=TargetString & mid(MyString,x,1) within a loop.
Test Driven Development
In "The Art of Agile Development" James and Shane say that the best way of developing new functionality is to take a lot of very small steps with simple tests at each stage. They believe that this is faster than trying to take bigger steps.
I am going to combine techniques from Ivor Horton and James Shore and Shane Warden to experiment with SOUNDEX.
The standard rules of SOUNDEX
Just to recap soundex has a fairly simple rule set which is as follows:
- Always retain the first letter
- For all other letters up to the first non-alphabetic character replace them with numbers as in the following table.
Number | Letter |
---|---|
0 | A,E,H,I,O,U,W,Y. In other workds vowels including the Welsh vowels H, W and Y |
1 | B,F,P, V |
2 | C, G, J, K, Q, S, X, Z |
3 | D, T |
4 | L |
5 | N, M |
6 | R |
- If any adjacent numbers occur in pairs or more only retain the first number
- Strip out the zeroes.
- If longer than four characters then truncate to 4 characters
- If shorter than four characters then pad out with zeroes.
The following SQL statement demonstrates this adequately.
WITH
cte(Letter, Number)AS
(SELECT
CHAR
(number), SUBSTRING(SOUNDEX('A'
+CHAR(number)),2,1)FROM
master..spt_valuesWHERE
type='P'
AND numberBETWEEN
65 AND 90 )SELECT
Number, LetterFROM
cteORDER BY
Number,Letter
The SOUNDEX function is adequate for basic misspellings as these usually fall into 3 camps:
- Wrong vowel
- Double-letters where they should be single and single where they should be double.
- Similar sounding consonants
What do I think it wrong with Soundex?
Actually it is not SOUNDEX itself I have issues with; it is the implementation of SOUNDEX; I think can be improved. SOUNDEX was first patented in 1918 as a manual system. One of the earliest uses of soundex was in a rolodex card system of voters for the American census. http://www.census.gov/history/www/genealogy/decennial_census_records/soundex_1.html.
For its original purpose in the early C20th it was a good solution but the I should like to make a couple of improvements:
- Take any non-alphabetic characters as word breaks and soundex each word
- Do not impose a fixed word length or at least allow a tuneable word length.
Before I leap into the code I'd like to do some analysis of the problem I'm trying to solve and carry out some basic design work.
Basic design and requirements
Where a SOUNDEX code contains letters they will always be upper case. I am going to have two counters to work through my CharArray.
- ValidCharacterPosition - points to the last valid character position and increments only for specific criteria
- CurrentCharacterPosition - points to the current character being evaluated. This increments for every character in the array.
Since the character at the position pointed to by " ValidCharacterPosition " can only ever be a letter, number or space then the rules for incrementing " ValidCharacterPosition " are pretty straight forward.
ValidCharacterPosition | CurrentCharacterPosition | Action |
---|---|---|
Letter | Not same letter | If ValidCharacterPosition and CurrentCharacterPosition do not point to position one then increment ValidCharacterPosition and replace character at that location with Soundex character |
Number | Not same number CurrentCharacter is zero | Replace character at ValidCharacterPosition with Soundex character Increment ValidCharacterPosition |
Number | Not same number and CurrentCharacter is not zero | Increment ValidCharacterPosition Replace character at ValidCharacterPosition with Soundex character |
Space | Letter | Increment ValidCharacterPosition Replace character at ValidCharacterPosition with Soundex character |
Setting up the Visual Studio project
I am using Visual Studio 2008 so my first step was to create a new database project as follows. From the "File" menu choose "New Project" to get up the dialogue box as shown below. I've chosen a "Database" project with both the solution and project name set to "Phonetics".
As the project is a database project I was asked for an initial connection so I simply chose one of my Adventureworks instances
Once my "Phonetics" project was created I selected it. By right-clicking on the "Phonetics" project and choosing " User Defined Function" I then added a user defined function called "SoundexLong" as shown in the dialogue box below.
I now have a simple boiler plate user defined function.
using
System;using
System.Data;using
System.Data.SqlClient;using
System.Data.SqlTypes;using
Microsoft.SqlServer.Server;public partial class
UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static SqlString SoundexLong() {// Put your code here
return new
SqlString("Hello"); } };
I changed the class name from "UserDefinedFunctions" to "Phonetics".
Choosing my project again I right-click and choose "Properties" and change my assembly name to "Phonetics" before building the project to get an initial DLL file.
Setting up the test project
Because I want to practice Test Driven Development (TDD) I want to add a test project. The easiest way for me to do this was to choose "New Test" from the "Test" menu at the top of the screen. This produced a Wizard that I filled in as shown.
Once the new test project had been created I chose "References" in the solution explorer, right-clicked and added the following references from the .NET tab
- System.Data
- System.XML
I also added the matching "Using" statements to the header of my "SoundexTests.cs" file.
using
System.Data;using
System.Xml;
From the project tab I added the reference to my "Phonetics" project.
At this stage I have a single test method called TestMethod1() which I edited as follows:
[TestMethod]public void
StartUpTest() { Assert.AreEqual("Hello"
,[Phonetics.SoundexLong()); }
I can now build my solution and run my tests. The tests can be run from the menu as shown below:
At this stage the tests pass and the test window at the bottom of the screen will appear as follows:
You can demonstrate failure by editing the test and rerunning it as follows:
Assert.AreEqual("Goodbye"
,Phonetics.SoundexLong());
Now we have a basic programme up and running with the test harness in place and functioning as required. We can now start building the SoundexLong function.
Writing the first real test
In Test Driven Development we write the test first knowing it will fail because we haven't written any code yet. We then write just enough code for it to pass.
Thinking about SoundexLong then if I pass in a single alphabetic character then I should get the same alphabetic character back out. With this in mind I edit the StartUpTest method to read as follows:
[TestMethod]public void
StartUpTest() { Assert.AreEqual("D"
,Phonetics.SoundexLong("D"
)); }
Because the SoundexLong function does not actually take an argument the solution won't even compile so I edit the signature of the function as follows:
public static
SqlString SoundexLong(string
InputString)
This does compile but the tests fail. Strictly speaking I could make the test pass simply by changing the body of my function as follows:
return new
SqlString(InputString.ToString());
However, my starting point is putting my InputString argument into a CharArray and then returning it back out again so I edit the body of my function as follows:
char
[] inputArray = InputString.Value.ToUpper().ToCharArray();return new
SqlString(inputArray.ToString());
And…..it Fails!
CharArray.ToString() simply returns the name of data type and not the contents. When I started to write this function this was something I didn't know. If I had not written a test or taken too many development steps forward then this would have been harder to pin down. To get the test to pass I had to amend my return statement as follows:-
return new
string(inputArray);
Phase two - Iterating through the string
What I have so far is basically changing the type of my InputString. What I have to do is prove that I can iterate through the string. Remember we write tests first, code later to pass the test? To that ends I add a new test to the bottom of my SoundexTests.cs file
[TestMethod]public void
SoundexCharTest() { Assert.AreEqual("Dave"
, Phonetics.SoundexLong("Dave"
)); }
I now change my SoundexLong function as follows:
public static
SqlString SoundexLong(string InputString) {char
[] inputArray = InputString.Value.ToUpper().ToCharArray();int
currentCharacterPosition = 0;int
validCharacterPosition = 0;while
(currentCharacterPosition < inputArray.Length) { inputArray[validCharacterPosition] = inputArray[currentCharacterPosition++]; }return new
string(inputArray); }
Run the tests and…..fail!
What has happened is that I forgot to increment my currentCharacterPosition pointer so all the characters from my string get copied to the first character. However this does demonstrate another important point. TDD is not a silver bullet, it is fallible. If I had used "David" rather than "Dave" the test would have passed giving me a false positive.
Correcting my code to the following resulted in a "Pass".
public static
SqlString SoundexLong(string InputString) {char
[] inputArray = InputString.Value.ToUpper().ToCharArray();int
currentCharacterPosition = 0;int
validCharacterPosition = 0;while
(currentCharacterPosition < inputArray.Length) { inputArray[validCharacterPosition] = inputArray[currentCharacterPosition++]; validCharacterPosition++; }return new
string(inputArray); }
Refactoring - Iterating through the input string.
The actual SOUNDEX character values are best returned by a function so, although we don't want to write the body of the function yet we can refactor our code to use a function.
Our new SoundexChar function becomes
private static char
SoundexChar(char
a) {return
a; }
We also modify our SoundexLong code so that
inputArray[validCharacterPosition] = inputArray[currentCharacterPosition++];
becomes
inputArray[validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition++]);
If the modifications are made correctly then both tests will pass. So existing tests prove that changed code still carries out the expected function.
Refactoring the Tests
So far we have proved that we can iterate through our string and call a function. Although it doesn't sound like much but at each stage we have validated our code using automated tests. What we have to do now is write a test for the actual SOUNDEX character values and the code to make it pass. Remember we are trying to develop by using simple increments so I am going to refactor the SoundexCharTest to test for basic character substitution.
[TestMethod]public void
SoundexCharTest() { Assert.AreEqual("D010"
, Phonetics.SoundexLong("Dave"
)); }
What I am not trying to do at present is:
- Get rid of vowels
- Get rid of double occurrences
- Detect when my starting character is not a letter
- Word breaks
These will come later.
Our first step is to modify our SoundexChar function to make the character substitutions.
private static char
SoundexChar(char
a) {switch
(a) {case
'A'
:case
'E'
:case
'I'
:case
'O'
:case
'U'
:case
'W'
:case
'H'
:case
'Y'
: a ='0'
;break
;case
'B'
:case
'F'
:case
'P'
:case
'V'
: a ='1'
;break
;case
'C'
:case
'G'
:case
'J'
:case
'K'
:case
'Q'
:case
'S'
:case
'X'
:case
'Z'
: a ='2'
;break
;case
'D'
:case
'T'
: a ='3'
;break
;case
'L'
: a ='4'
;break
;case
'N'
:case
'M'
: a ='5'
;break
;case
'R'
: a ='6'
;break
;default
: a =' '
;break
; }return
a; }
Note that any non-alphabetic characters return a space
The only change we need to make to our SoundexLong function is to change the starting position of our two variables to the following:
int currentCharacterPosition = 1; int validCharacterPosition = 1;
Adding a test for when our input string is not a character
We know that we want our SoundexLong character to be more robust than the inbuilt T-SQL SOUNDEX function so we are going to have to deal with cases where an input string does not being with a letter.
Our simple test is as follows:
public void
LetterOneNotAlphaTest() { Assert.AreEqual("D010"
, Phonetics.SoundexLong("1Dave"
)); }
This means we need a function that determines whether or not a character is a letter or not.
private static bool
IsLetter(char a){bool
returnValue=false
;int
i = Convert.ToInt32(a);if
((i >= Convert.ToInt32('A') && i <= Convert.ToInt32('Z'))) { returnValue =true
; }return
returnValue; }
There are a couple of points to note here
- An IsLetter() function is generically useful and would probably sit in a centralised maintained code library rather than in an individual solution
- This implementation simply checks for capital letters as we know our inputString is being forced to upper case. A more robust implementation would also check for lower case letters.
In our main code we revert our two variables to their original values
int
currentCharacterPosition = 0;int
validCharacterPosition = 0;
We then take a look at the first character in the string and if it is not a letter then we replace it with a space.
if
(IsLetter(inputArray[0])) { validCharacterPosition++; }else
{ inputArray[0] = ' '; } currentCharacterPosition++;
Our while loop now needs to detect whether or not the validCharacterPosition is a space and the next one is a letter. If so then we will grab the letter, otherwise we will grab the SoundexChar value.
while
(currentCharacterPosition < inputArray.Length) { if (inputArray[validCharacterPosition] == ' ' && IsLetter(inputArray[currentCharacterPosition])) { inputArray[validCharacterPosition] = inputArray[currentCharacterPosition++]; }else
{ inputArray[validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition++]); } validCharacterPosition++; }
We run the tests and….fail!
What is happening here is by discarding the invalid first character our validCharacterPosition never reaches the end of the string. We are correctly encoding "DAVE" but in discarding the first character we are leaving the trailing "E".
What we need is some additional code after the "while" loop to replace any remaining characters. To make the test pass we need to change the bottom of our function to wipe any extraneous characters as follows:-
while
(validCharacterPosition < currentCharacterPosition) { inputArray[validCharacterPosition++] = ' '; }return new string(
inputArray).Trim();
Multiple word test
One of the key things we want to achieve is to have our SoundexLong function handle multiple words. To do this we add a new test
[TestMethod]public void
TwoWordTest() { Assert.AreEqual("D010 P0040"
, Phonetics.SoundexLong("Dave Poole"
)); }
This fails as our existing code encodes the letter "P" in Poole whereas we want it to treat this as the start of a new word. The key to fixing this is to look at when we increment the validCharacterPosition variable. We should only increment it when we want just before we write a new valid character and not after we have just written one.
This time to make the test pass we have three amendments to make. For our handling of the first character of our inputString discard the increment of the validCharacterPosition.
if
(!IsLetter(inputArray[0]))
{
inputArray[0] = ' ';
}
Change our if statement within the while loop to increment validCharacterPosition just as we want to write the next valid character.
if
(inputArray[validCharacterPosition] ==' '
&& IsLetter(inputArray[currentCharacterPosition])) { inputArray[++validCharacterPosition] = inputArray[currentCharacterPosition++]; }else
{ inputArray[++validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition++]); }
As we increment only when we want to write a valid character we have to add an additional increment between our main while loop and the loop to erase extraneous strings.
Tests for double occurrences
As mentioned earlier one of the strengths of SOUNDEX is that it handles vowels and double consonants. The key to this working with our SoundexLong function lies with the events that cause us to increment our validCharacterPosition variable. The value that SoundexLong should produce for Llangollen should be L0520405.
- Only take into account the first "L" at the start of the word
- Only take into account the first "L" in the middle of the word.
So what we have to amend our SoundexLong function to do is to increment our validCharacterPosition variable whenever neither of the following conditions are true
- The last valid character matches the current character
- The last valid character matches the SOUNDEX character
This does mean that we have to refactor our TwoWordTest.
Assert.AreEqual("D010 P0040"
, Phonetics.SoundexLong("Dave Poole"
));
Will become
Assert.AreEqual("D010 P040"
, Phonetics.SoundexLong("Dave Poole"
));
We will also add our "Llangollen" test.
[TestMethod]public void
LlangollenTest() { Assert.AreEqual("L0520405"
, Phonetics.SoundexLong("llangollen"
)); }
This is going to make the logic of our if statement within our SoundexLong statement more complicated so the first thing we are going to do is refactor it.
We are going to take the increment of currentCharacterPosition out of the code to write the validCharacterPosition and put it at the end of the while loop
inputArray[validCharacterPosition] = inputArray[currentCharacterPosition++];
will become
inputArray[validCharacterPosition] = inputArray[currentCharacterPosition];
…etc.
Our test for a space followed by a letter will be split into two separate if statements
if (inputArray[validCharacterPosition] == ' ')
{
if
(IsLetter(inputArray[currentCharacterPosition])){
inputArray[++validCharacterPosition] = inputArray[currentCharacterPosition];
}
}
This is because we need to detect when an alternative to a space has been detected. This means we gain an else statement.
else
{if
(!((inputArray[validCharacterPosition] == inputArray[currentCharacterPosition]) || inputArray[validCharacterPosition]==SoundexChar(inputArray[currentCharacterPosition]))) { inputArray[++validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition]); } }
Refactoring tests to deal with vowels
All vowels, including the Welsh vowels (W, H & Y) are silent delimiters within SOUNDEX.
- Annotate will start A53 as the SOUNDEX value of 'n' will only be counted once.
- Anonymous will start A55 as the letter 'O' separates the two 'n' characters.
This means that all our tests that currently expect the return value to contain zeros must be changed so they no longer do so. We could also do with a test that has multiple words and multiple different vowels.
[TestMethod]public void
LlangollenHeritageRailwayTest() { Assert.AreEqual("L5245 H632 R4"
, Phonetics.SoundexLong("llangollen heritage railway"
)); }
The change to take into account the zeros is a slight refactoring of the 'if' statement above.
if
(!((inputArray[validCharacterPosition] == inputArray[currentCharacterPosition]) || inputArray[validCharacterPosition]==SoundexChar(inputArray[currentCharacterPosition]))) {// Only increment the validCharacterPosition if it is not a vowel
if
(inputArray[validCharacterPosition] != '0') { validCharacterPosition++; } inputArray[validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition]); }
We have taken the ++validCharacterposition out of the assignment and put it in only when the validCharacterPosition is not zero (indicating that it is a silent vowel).
We run our tests and…..fail.
What is happening here is that vowels are assigned a value of zero but then overwritten by the next valid non-vowel. The problem is that when a vowel is at the end of a word then there is no next valid non-vowel!
Fortunately this is easily corrected by adding the following code immediately after the end of the first while loop.
// Blank out any trailing vowel
if
(inputArray[validCharacterPosition]=='0'
) { inputArray[validCharacterPosition]=' '
; }
Further Tests
Within our main "Phonetics" solution there is a T-SQL file called Test.SQL
Once you have deployed your working solution to SQL Server you can edit this file to run the T-SQL implementation of your function by clicking on the "Start Debugging" toolbar icon or pressing F5.
By amending the Test.SQL as follows I uncovered an embarrassing omission in my test coverage.
IF
EXISTS(SELECT
1FROM
INFORMATION_SCHEMA.TABLESWHERE
TABLE_NAME='Soundextest'
)BEGIN DROP TABLE
dbo.SoundexTest'Table dropped: dbo.SoundexTest'
END
;CREATE TABLE
dbo.SoundexTest(InputStringVARCHAR
(100) NOT NULL)INSERT INTO
dbo.SoundexTestVALUES
('Dave Poole'
),('Llangollen'
),('Llangollen Heritage Railway'
),(''
)SELECT
InputString, Soundex_Long=dbo.SoundexLong(InputString)FROM
dbo.SoundexTestSELECT
dbo.SoundexLong(null)IF
EXISTS(SELECT
1FROM
INFORMATION_SCHEMA.TABLESWHERE
TABLE_NAME='Soundextest'
)BEGIN DROP TABLE
dbo.SoundexTest'Table dropped: dbo.SoundexTest'
END
;
Both passing an empty string and null character caused a .NET exception to be thrown. This led me to add two further tests.
[TestMethod]public void
ZeroLengthStringTest() { Assert.AreEqual("", Phonetics.SoundexLong("")); } [TestMethod]public void
NullStringTest() { Assert.AreEqual("", Phonetics.SoundexLong(null)); }
To detect these values I added the following code as the start of our SoundexLong body
if
(String.IsNullOrEmpty(InputString))
{
InputString=" ";
}
It would also be a good idea to test that all alphabetic character have been covered by our SoundexChar() function.
[TestMethod]public void
AllAlphabeticsTest() {int
TotalLength = 0;for
(int
i = Convert.ToInt32('A'
); i <= Convert.ToInt32('Z'
); i++) { TotalLength+=Phonetics.SoundexLong("A"
+ Convert.ToChar(i)).ToString().Length; } Assert.AreEqual(44, TotalLength); }
The way that this test works is by generating pairs of letters.
- AA
- AB
- AC
- …
- AZ
There are 8 vowels (remember W, H & Y) and therefore 18 consonants. If our function is working correctly then each call to the consonant combination should produce a 2 character string whereas calls to the vowel combination should produce a single character.
2x18 = 36
1x 8 = 8
36 + 8 = 44
If we have missed any letters from SoundexChar() then they will get replaced by a space which will be trimmed and therefore the total length of the string will be less than 44 characters long.
Concluding thoughts
Personally I find that the small incremental approach recommended by Shore & Warden works for me. Prior to adopting this approach I would try to do too much at once and for every 5 minutes saved by taking big steps forward I would lose 10 minutes in trying to isolate the bugs.
May people have stated that at first Test Driven Development feels like an unnatural act. It definitely forced me to think how I was going to test some aspect of my code and I found this made me think much more deeply about the design of my function. Anything that encourages measured thought before cutting code has to be a good thing. Measure twice, cut once!
There is a very definite skill in designing tests that add value to the solution. I found myself writing tests and then finding that I could achieve the same thing with fewer, better thought out tests.
Where I found the most benefit in the tests was when I refactored the code. That is really what they are there for, to confirm that if I've changed something, have I screwed anything up.
Does the function achieve what I hoped it would achieve? Well that is a story for another time but the query below gives a clue.
USE
AdventureworksGO DECLARE
@SearchTermVARCHAR
(100)SET
@SearchTerm='Airodynamick-Bike'
SET
@SearchTerm ='%'
+ dbo.SoundexLong(@SearchTerm)+'%'
SELECT
ProductDescriptionID ,Description ,PhoneticDescription=dbo.SoundexLong(Description)FROM
Production.ProductDescriptionWHERE
dbo.SoundexLong(Description)LIKE
@SearchTerm