In February 2025, Microsoft introduced new T-SQL functions in Azure SQL Database and SQL Database in Microsoft Fabric to enhance fuzzy string matching capabilities. You can visit Exciting new T-SQL features: Regex support, Fuzzy string-matching, and bigint support in DATEADD – preview for more details.
This article will deep dive into newly introduced functions and explain the differences and use cases of these newly introduced functions supporting fuzzy search.
The Need for Fuzzy String Matching in SQL Server
Handling user input in search operations can be challenging due to misspellings, variations in spelling, and inconsistent data entry. Traditional SQL queries using LIKE or exact matches often fail to retrieve relevant results when minor differences exist between stored data and search queries. This is particularly problematic in applications dealing with customer records, product names, or natural language searches, where users may not always enter exact strings.
Previously, developers relied on external solutions like Elasticsearch or Python-based fuzzy matching to overcome these challenges because SQL Server lacked built-in support for approximate string matching. These external solutions provided more advanced similarity metrics, phonetic matching, and flexible indexing methods that allowed for efficient fuzzy searches in large datasets. However, integrating them required additional infrastructure, performance tuning, and maintenance, making them complex and costly to implement. Other SQL-based solutions, both paid and free, have also been used for approximate string matching, such as:
- SQL Server Full-Text Search (built-in but limited in fuzzy capabilities)
- Levenshtein Function Implementation (custom implementations using T-SQL or CLR functions)
- Soundex and Metaphone Algorithms (available in some SQL databases for phonetic matching. There are some implementations for SQL Server as well)
- Commercial Solutions like Azure Cognitive Search or Oracle Text (providing more advanced fuzzy search capabilities)
However, these approaches often come with disadvantages such as increased complexity, performance overhead, limited customization, or additional licensing costs. Microsoft has now introduced built-in fuzzy search capabilities in SQL Server, eliminating the need for third-party tools and enabling more efficient approximate string matching within SQL queries.
These functions—EDIT_DISTANCE, EDIT_DISTANCE_SIMILARITY, JARO_WINKLER_DISTANCE, and JARO_WINKLER_SIMILARITY—allow developers to assess the similarity between strings directly within SQL queries.
Understanding the differences between these functions and their appropriate use cases is essential for effective data processing. Here's a quick comparison:
- EDIT_DISTANCE: Measures the number of character edits (insertions, deletions, substitutions) needed to transform one string into another.
- EDIT_DISTANCE_SIMILARITY: Returns a similarity percentage based on the edit distance, making it useful for ranking potential matches.
- JARO_WINKLER_DISTANCE: Calculates similarity by considering matching characters and their transpositions, with more weight given to initial character similarity.
- JARO_WINKLER_SIMILARITY: Similar to Jaro-Winkler Distance but returns a normalized similarity score between 0 and 1.
Each function has unique strengths, and selecting the right one depends on the type of data and use case. Let's explore each function in detail.
Setting Up a Test Environment for Performance Evaluation
To effectively evaluate the performance of these fuzzy string-matching functions, it's essential to test them on a significant dataset. A large dataset helps assess query performance under realistic conditions, revealing potential bottlenecks and ensuring scalability. It also provides a more accurate representation of real-world variations in data, allowing for meaningful comparisons of different fuzzy matching techniques. Below is a SQL script to create a Persons table and populate it with 1000 random records:
/*Setup*//*Create person table with only FirstName*/IF OBJECT_ID('Persons', 'U') IS NOT NULL DROP TABLE Persons; CREATE TABLE Persons ( FirstName NVARCHAR(50) ); GO /*Create random records*/SET NOCOUNT ON; -- Create temporary table for sample first names DECLARE @FirstNames TABLE (Name NVARCHAR(50)); -- Populate First Names INSERT INTO @FirstNames VALUES ('James'), ('Mary'), ('John'), ('Patricia'), ('Robert'), ('Jennifer'), ('Michael'), ('Linda'), ('William'), ('Elizabeth'), ('David'), ('Barbara'), ('Richard'), ('Susan'), ('Joseph'), ('Jessica'), ('Charles'), ('Sarah'), ('Thomas'), ('Karen'), ('Amit'), ('Raj'), ('Suresh'), ('Neha'), ('Priya'), ('Sanjay'), ('Anil'), ('Pooja'), ('Ravi'), ('Manoj'), ('Alok'), ('Deepak'), ('Dipak'), ('Sunil'), ('Rahul'), ('Vikram'), ('Arun'), ('Kiran'), ('Varun'), ('Anjali'), ('Meera'), ('Rohan'), ('Kartik'), ('Vishal'), ('Aarav'), ('Ishaan'), ('Divya'), ('Sneha'), ('Harsh'), ('Aditya'), ('Nikhil'), ('Gaurav'), ('Kavya'), ('Chris'), ('Christopher'), ('Christian'), ('Christina'), ('Christoph'), ('Kriss'), ('Darrick'), ('Derrick'), ('Dereck'), ('Dirk'), ('Eric'), ('Erik'), ('Jon'), ('Jonathan'), ('Johnny'), ('Jack'), ('Jackson'), ('Stephen'), ('Steven'), ('Stefan'), ('Steph'), ('Steve'), ('Catherine'), ('Katherine'), ('Kathryn'), ('Cathy'), ('Kathy'), ('Kathleen'), ('Zachary'), ('Zack'), ('Zac'), ('Zak'), ('Philip'), ('Phillip'), ('Phil'), ('Filip'), ('Mathew'), ('Matthew'), ('Matt'), ('Matthias'); -- Insert Random Data into Persons Table DECLARE @Counter INT = 0; DECLARE @RandomFirstName NVARCHAR(50); WHILE @Counter < 1000 BEGIN -- Select random first name SELECT TOP 1 @RandomFirstName = Name FROM @FirstNames ORDER BY NEWID(); -- Insert into the table INSERT INTO Persons (FirstName) VALUES (@RandomFirstName); SET @Counter = @Counter + 1; END; PRINT '100,000 records inserted successfully.'; SELECT * FROM dbo.Persons;
1. EDIT_DISTANCE
Description: Calculates the number of single-character edits (insertions, deletions, substitutions, or transpositions) required to transform one string into another.
Use Case: Ideal for identifying exact differences between two names, such as detecting typos or minor variations.
Example:
DECLARE @QueryString NVARCHAR(256)='Aron' SELECT P.FirstName,EDIT_DISTANCE(p.FirstName, @QueryString) as ED_Distance FROM dbo.Persons AS P WHERE EDIT_DISTANCE(p.FirstName, @QueryString) <3
Result:
FirstName | SearchString | Distance |
---|---|---|
Arun | Aron | 1 |
This indicates that only one edit (substitution of 'a' with 'o') is needed to transform "Aron" into "Arun".
2. EDIT_DISTANCE_SIMILARITY
Description: Returns a similarity score between 0 and 100 based on the edit distance, where 0 indicates no similarity and 100 indicates an exact match.
Use Case: Useful for quantifying the similarity between names, especially when determining thresholds for matches.
Example:
DECLARE @QueryString NVARCHAR(256)='Aron' SELECT P.FirstName,EDIT_DISTANCE_SIMILARITY(p.FirstName, @QueryString) as ED_Distance FROM dbo.Persons AS P
Result:
FirstName | SearchString | Similarity |
---|---|---|
Arun | Aron | 75 |
A similarity score of 75 suggests a high degree of resemblance between the two names.
Another example:
FirstName | SearchString | Similarity |
---|---|---|
Alok | Aron | 0 |
Here score 0 indicates no similarity between search string and name.
3. JARO_WINKLER_DISTANCE
Description: Calculates a distance metric between two names using the Jaro-Winkler algorithm, which gives more weight to names that match from the beginning for a set prefix length.
Use Case: Effective for comparing shorter names, where common prefixes are significant.
Example:
DECLARE @QueryString NVARCHAR(256)='chris' SELECT P.FirstName ,JARO_WINKLER_DISTANCE(p.FirstName, @QueryString) as JW_Distance FROM dbo.Persons AS P WHERE JARO_WINKLER_DISTANCE(p.FirstName, @QueryString)<.3
Result:
FirstName | JW_Distance |
Kriss | 0.2666667 |
Christoph | 0.2518519 |
Christina | 0.2518519 |
Christian | 0.2518519 |
Christopher | 0.2787879 |
Chris | 0.1333334 |
Please note that a lower distance value indicates higher similarity.
4. JARO_WINKLER_SIMILARITY
Description: Returns a similarity score between 0 and 1 using the Jaro-Winkler algorithm, where 0 indicates no match and 1 indicates an exact match.
Use Case: Suitable for applications requiring a normalized similarity score, particularly when matching names.
Example:
DECLARE @QueryString NVARCHAR(256)='chris' SELECT DISTINCT P.FirstName ,JARO_WINKLER_SIMILARITY(p.FirstName, @QueryString) as JW_Similarity FROM dbo.Persons AS P WHERE JARO_WINKLER_DISTANCE(p.FirstName, @QueryString)<.3
Result:
FirstName | JW_Similarity |
Christoph | 0.748148 |
Kriss | 0.733333 |
Christian | 0.748148 |
Chris | 0.866667 |
Christina | 0.748148 |
Christopher | 0.721212 |
A similarity score close to 1 indicates a strong resemblance between the names.
Case Sensitivity in Fuzzy Matching
These functions are case-sensitive, meaning that variations in capitalization can change similarity scores significantly. For example, changing 'Aron' to 'aron' can increase the edit distance, leading to different results. If case-insensitive comparisons are required, one approach is to convert both the input and stored values to the same case using LOWER() or UPPER(), like this:
DECLARE @QueryString NVARCHAR(256)='aron'; SELECT P.FirstName, EDIT_DISTANCE(LOWER(P.FirstName), LOWER(@QueryString)) AS ED_Distance FROM dbo.Persons AS P WHERE EDIT_DISTANCE(LOWER(P.FirstName), LOWER(@QueryString)) < 3;
This ensures that case differences do not impact the similarity score., meaning that variations in capitalization can lead to different results. For example, changing 'Aron' to 'aron' can significantly affect the similarity score.
Example:
DECLARE @QueryString NVARCHAR(256)='aron' SELECT P.FirstName, EDIT_DISTANCE(P.FirstName, @QueryString) AS ED_Distance FROM dbo.Persons AS P WHERE EDIT_DISTANCE(P.FirstName, @QueryString) < 3;
Result:
FirstName | SearchString | Distance |
---|---|---|
Arun | aron | 2 |
Since 'Aron' and 'aron' have different cases, SQL Server treats them as more distinct, leading to a higher edit distance.
Choosing the Right Function
- EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY: Best for scenarios where the exact number of edits or a percentage-based similarity is needed. Suitable for longer names where specific differences are important.
- JARO_WINKLER_DISTANCE and JARO_WINKLER_SIMILARITY: Preferable for shorter names, where common prefixes are significant. These functions are more sensitive to similarities at the beginning of names.
By leveraging these functions, developers can implement robust fuzzy matching and name comparison operations directly within SQL Server, enhancing data quality and user experience. Experimenting with these functions on different datasets and adjusting similarity thresholds can help optimize results for various use cases.