Introduction
After a recent SQL Server Central article on the improvements in SQL 2005 User-Defined Data Types (UDT's) by author Dinesh Priyankara, I set out to research practical applications for SQL 2005 UDT's. The result of my research is this implementation of the Double Metaphone algorithm for SQL 2005.
Download the source code and binaries by clicking the link at the bottom of this article.
From Soundex to Metaphone to DoubleMetaphone
In late 1990, Lawrence Phillips introduced the Metaphone phonetic matching algorithm. This algorithm provided several advantages over Soundex, the de facto standard for phonetic matching at the time. One of the major advantages of Metaphone is that it encoded certain groups of letters together, whereas Soundex encoded only a single letter at a time. For instance, the word "GNU" encodes in Soundex as 'G500' indicating that the 'G' and 'N' are pronounced separately. In Metaphone "GNU" it encodes as 'N', indicating that the 'G' in 'GN' is silent1.
In June 2000, Mr. Phillips improved phonetic matching again with the introduction of the Double Metaphone algorithm2,3. This algorithm recognized that different letter combinations can be pronounced differently in different circumstances. For instance, in Spanish the name "VILLA" would be pronounced "VEE-YAH", however the 'LL' combination can also cause "VILLA" to be pronounced "VIL-UH" in English. Double Metaphone accounts for this discrepancy in pronunciation by giving us a primary encoding of 'FL' and an alternate encoding of 'F' for the word "VILLA". Soundex returns the single encoding of 'V400' for the same word.
The User-Defined Data Type
When designing this application, the first thing we need to do is determine the language we want to write code in. For this example, we'll give VB Express 2005 (Beta 2)4 along with SQL Express 2005 (June CTP)5 a shot. Note that the versions are very important, since we're still in Beta-ville and things are constantly changing.
With that out of the way, we can get into the implementation details and decide what type of result(s) we want to store. A good starting point is the output we can expect from Double Metaphone. According to Mr. Phillips, Double Metaphone should return two encoded strings, each 4 characters. If an encoded string is not 4 characters long, it should be right padded with spaces to make it 4 characters long. To this end, we'll implement a SQL Server 2005 .NET User-Defined Data Type to hold both of our encodings in a single object. Here are some of the highlights of this UDT (See the download file for complete source code):
' This UDT is DoubleMetaphoneResults. It contains the results of a Double Metaphone ' Encoding. The namespace for both the DoubleMetaphoneResults UDT and DoubleMetaphone ' class is Phonetics.Tools Namespace Phonetic.Tools ' Here we define it as Format.UserDefined. This is a requirement for UDT's that ' expose string or non-value (reference) properties. We also declared it Serializable(), ' although MS has stated this will not be a requirement in the future. <Serializable(), _ SqlUserDefinedType(Format.UserDefined, IsByteOrdered:=True, MaxByteSize:=16), _ StructLayout(LayoutKind.Sequential)> _ Public Class DoubleMetaphoneResult Implements INullable, IBinarySerialize ' SQL 2005 .NET UDT's need to Implement INullable. Since it is Format.UserDefined, ' we have to implement IBinarySerialize as well ' IBinarySerialize requires that we implement our own Read and Write Methods Public Sub Write(ByVal w As System.IO.BinaryWriter) Implements IBinarySerialize.Write ' <snip> End Sub Public Sub Read(ByVal r As System.IO.BinaryReader) Implements IBinarySerialize.Read ' <snip> End Sub ' INullable requires us to implement a read-only Null and IsNull properties Private _IsNull As Boolean Public Shared ReadOnly Property Null() As DoubleMetaphoneResult '<snip> End Property Public ReadOnly Property IsNull() As Boolean Implements INullable.IsNull '<snip> End Property ' Parse allows us to parse an input string into a DoubleMetaphoneResult ' In our case we accept two 4 byte double-metaphone strings concatenated ' into one 8-byte string as input Public Shared Function Parse(ByVal data As SqlString) As DoubleMetaphoneResult ' <snip> End Function ' ToString() is another handy function that converts our DoubleMetaphoneResult ' into one 8-byte string. The first four bytes represent the first, or primary, ' encoding. The last four bytes represent the second, or alternate, encoding Public Overrides Function ToString() As String ' <snip> End Function ' Now for the properties that actually make up the meat of what we're trying to ' do. There is a First property that exposes the first, or primary, encoding ' and a Second property that exposes the second, or alternate, encoding. ' Note that I did not name these properties Primary and Alternate because ' Primary is a reserved keyword. Private _First As String Private _Second As String Public Property First() As String ' <snip> End Property Public Property Second() As String ' <snip> End Property End Class End Namespace |
If you decide to explore the source code, you'll notice that the actual public properties we want to expose take up very little code. The majority of the code in our .NET UDT is pretty much "boiler-plate" interface implementation code. Since we have exposed string type properties in our UDT we have to use Format.UserDefined serialization. If we exposed only value types (Integers, etc.), we could have used the Format.Native serialization, which is faster, instead. This would have made things a little simpler since we wouldn't have had to implement the IBinarySerialize interface either.
Here's our SQL 2005/.NET UDT:
Overview of Phonetic.Tools.DoubleMetaphoneResult UDT Class | ||
Public Properties | ||
First | The first, or primary, Double Metaphone encoded string | |
Second | The second, or alternate, Double Metaphone encoded string | |
Null | A Null DoubleMetaphoneResult UDT object (Read Only) | |
IsNull | Returns True if a DoubleMetaphoneResult UDT object is Null (Read Only) | |
Public Methods | ||
Read() | Implements the IBinarySerialize.Read() method | |
Write() | Implements the IBinarySerialize.Write() method | |
Parse() | Parses a string into a DoubleMetaphoneResult object | |
ToString() | Converts a DoubleMetaphoneResult object into a string |
The User-Defined Functions
Now that we have our UDT defined, we need to implement the class that will utilize our UDT. We will implement two functionas: 1) Generate a DoubleMetaphoneResult from an input string and 2) Compare two DoubleMetaphoneResult objects.
Double Metaphone is a fairly complex algorithm that takes literally hundreds of combinations of letters, exceptions to normal pronunciation and other special criteria into account. The implementation and description of Double Metaphone could fill an entire article by itself. For the implementation, I ported Mr. Phillips' C++ version directly over to VB Express 2005. So to stay on target, we'll skip most of the implementation details in this article. There are links in the references section, as well as the downloadable source code, that provide grater detail.
In order to implement Double Metaphone encoding functionality, we will create a class called Phonetic.Tools.DoubleMetaphone. This class will have two public methods: DoubleMetaphoneEncode() and DoubleMetaphoneCompare(). These two methods will be the basis for our SQL Server 2005 Double Metaphone User-Defined Functions.
There are a couple of items that are noteworthy during the implementation of the public methods:
- Our methods must be "Shared" (or "static" for you C# programmers out there). This isn't that big of a deal for us to change in this instance.
- The methods are preceded with the <SqlFunction(IsDeterministic:=True)> attribute, indicating that they are functionally enabled for SQL Server 2005 and that they are Deterministic functions.
- Additionally, the .NET String data type corresponds to the SQL 2005 NVARCHAR data type, and cannot accept VARCHAR values. We will have to make sure that our SQL-side function definitions account for this.
Apart from those items, implementing a SQL Server 2005 UDF in .NET was as simple as writing a Function in Visual Basic. Our DoubleMetaphone class exposes only two public methods:
Overview of Phonetic.Tools.DoubleMetaphoneResult UDT Class | ||
Public Methods | ||
DoubleMetaphoneEncode() | Accepts an NVARCHAR value and encodes it using Double Metaphone. Returns a DoubleMetaphoneResult object. | |
DoubleMetaphoneCompare() | Accepts two DoubleMetaphoneResult objects as inputs, and compares them. Returns an Integer between 0 and 3 indicating the match score. |
The DoubleMetaphoneEncode() function accepts an NVARCHAR string as input and then encodes it using the Double Metaphone algorithm. The result is returned in a DoubleMetaphoneResult UDT object. The DoubleMetaphoneCompare() function accepts two encoded DoubleMetaphoneResult UDT objects and returns a match score based on the following table6:
DoubleMetaphoneCompare() Results. | |
Score | Description |
3 | Strong Match. Primary1 = Primary2 |
2 | Medium Match. Primary1 = Alternate2 |
2 | Medium Match. Primary2 = Alternate1 |
1 | Weak Match. Alternate1 = Alternate2 |
0 | No Match |
The value returned by the DoubleMetaphoneCompare() function can be used to filter matches and narrow down search result sets.
Installation and Usage
To install Double Metaphone, just follow these steps:
- Copy the DoubleMetaphone.dll file from the DoubleMetaphone\Install directory of the ZIP file to your SQL Server 2005 MSSQL\Binn directory.
- Open and run the DoubleM_Install.SQL script (located in the DoubleMetaphone\SQL_Queries directory of the ZIP file). You may have to modify the directory path of the CREATE ASSEMBLY statement.
To install the sample [People] table with about 4,500 surnames for testing, run the Create_People_Table.SQL script. Once the [People] table is installed, you can run DoubleM_Query_Demo.SQL which demonstrates one possible method of using the UDF's and UDT's.
To uninstall, just run the DoubleM_Drop.SQL script. It removes the [People] table, drops the UDF's and UDT, and drops the DoubleMetaphone assembly.