SQLServerCentral Article

Implementing Fuzzy Search in SQL Server Using New Inbuilt Functions

,

In February 2025, Microsoft introduced new T-SQL functions in Azure SQL Database and SQL Database in Microsoft Fabric to enhance fuzzy string matching capabilities. You can visit Exciting new T-SQL features: Regex support, Fuzzy string-matching, and bigint support in DATEADD – preview for more details.

This article will deep dive into newly introduced functions and explain the differences and use cases of these newly introduced functions supporting fuzzy search.

The Need for Fuzzy String Matching in SQL Server

Handling user input in search operations can be challenging due to misspellings, variations in spelling, and inconsistent data entry. Traditional SQL queries using LIKE or exact matches often fail to retrieve relevant results when minor differences exist between stored data and search queries. This is particularly problematic in applications dealing with customer records, product names, or natural language searches, where users may not always enter exact strings.

Previously, developers relied on external solutions like Elasticsearch or Python-based fuzzy matching to overcome these challenges because SQL Server lacked built-in support for approximate string matching. These external solutions provided more advanced similarity metrics, phonetic matching, and flexible indexing methods that allowed for efficient fuzzy searches in large datasets. However, integrating them required additional infrastructure, performance tuning, and maintenance, making them complex and costly to implement. Other SQL-based solutions, both paid and free, have also been used for approximate string matching, such as:

  • SQL Server Full-Text Search (built-in but limited in fuzzy capabilities)
  • Levenshtein Function Implementation (custom implementations using T-SQL or CLR functions)
  • Soundex and Metaphone Algorithms (available in some SQL databases for phonetic matching. There are some implementations for SQL Server as well)
  • Commercial Solutions like Azure Cognitive Search or Oracle Text (providing more advanced fuzzy search capabilities)

However, these approaches often come with disadvantages such as increased complexity, performance overhead, limited customization, or additional licensing costs. Microsoft has now introduced built-in fuzzy search capabilities in SQL Server, eliminating the need for third-party tools and enabling more efficient approximate string matching within SQL queries.

These functions—EDIT_DISTANCE, EDIT_DISTANCE_SIMILARITY, JARO_WINKLER_DISTANCE, and JARO_WINKLER_SIMILARITY—allow developers to assess the similarity between strings directly within SQL queries.

Understanding the differences between these functions and their appropriate use cases is essential for effective data processing. Here's a quick comparison:

  • EDIT_DISTANCE: Measures the number of character edits (insertions, deletions, substitutions) needed to transform one string into another.
  • EDIT_DISTANCE_SIMILARITY: Returns a similarity percentage based on the edit distance, making it useful for ranking potential matches.
  • JARO_WINKLER_DISTANCE: Calculates similarity by considering matching characters and their transpositions, with more weight given to initial character similarity.
  • JARO_WINKLER_SIMILARITY: Similar to Jaro-Winkler Distance but returns a normalized similarity score between 0 and 1.

Each function has unique strengths, and selecting the right one depends on the type of data and use case. Let's explore each function in detail.

Setting Up a Test Environment for Performance Evaluation

To effectively evaluate the performance of these fuzzy string-matching functions, it's essential to test them on a significant dataset. A large dataset helps assess query performance under realistic conditions, revealing potential bottlenecks and ensuring scalability. It also provides a more accurate representation of real-world variations in data, allowing for meaningful comparisons of different fuzzy matching techniques. Below is a SQL script to create a Persons table and populate it with 1000 random records:

/*Setup*//*Create person table with only FirstName*/IF OBJECT_ID('Persons', 'U') IS NOT NULL DROP TABLE Persons;
    CREATE TABLE Persons (
        FirstName NVARCHAR(50)
    );
GO
/*Create random records*/SET NOCOUNT ON;
-- Create temporary table for sample first names
DECLARE @FirstNames TABLE (Name NVARCHAR(50));
-- Populate First Names
INSERT INTO @FirstNames VALUES 
('James'), ('Mary'), ('John'), ('Patricia'), ('Robert'), ('Jennifer'), ('Michael'), ('Linda'), 
('William'), ('Elizabeth'), ('David'), ('Barbara'), ('Richard'), ('Susan'), ('Joseph'), ('Jessica'), 
('Charles'), ('Sarah'), ('Thomas'), ('Karen'), ('Amit'), ('Raj'), ('Suresh'), ('Neha'), ('Priya'),
('Sanjay'), ('Anil'), ('Pooja'), ('Ravi'), ('Manoj'), ('Alok'), ('Deepak'), ('Dipak'), ('Sunil'), ('Rahul'),
('Vikram'), ('Arun'), ('Kiran'), ('Varun'), ('Anjali'), ('Meera'), ('Rohan'), ('Kartik'), ('Vishal'),
('Aarav'), ('Ishaan'), ('Divya'), ('Sneha'), ('Harsh'), ('Aditya'), ('Nikhil'), ('Gaurav'), ('Kavya'),
('Chris'), ('Christopher'), ('Christian'), ('Christina'), ('Christoph'), ('Kriss'),
('Darrick'), ('Derrick'), ('Dereck'), ('Dirk'), ('Eric'), ('Erik'),
('Jon'), ('Jonathan'), ('Johnny'), ('Jack'), ('Jackson'),
('Stephen'), ('Steven'), ('Stefan'), ('Steph'), ('Steve'),
('Catherine'), ('Katherine'), ('Kathryn'), ('Cathy'), ('Kathy'), ('Kathleen'),
('Zachary'), ('Zack'), ('Zac'), ('Zak'),
('Philip'), ('Phillip'), ('Phil'), ('Filip'),
('Mathew'), ('Matthew'), ('Matt'), ('Matthias');
-- Insert Random Data into Persons Table
DECLARE @Counter INT = 0;
DECLARE @RandomFirstName NVARCHAR(50);
WHILE @Counter < 1000
BEGIN
    -- Select random first name
    SELECT TOP 1 @RandomFirstName = Name FROM @FirstNames ORDER BY NEWID();
    
    -- Insert into the table
    INSERT INTO Persons (FirstName)
    VALUES (@RandomFirstName);
    SET @Counter = @Counter + 1;
END;
PRINT '100,000 records inserted successfully.';
SELECT * FROM dbo.Persons;

1. EDIT_DISTANCE

Description: Calculates the number of single-character edits (insertions, deletions, substitutions, or transpositions) required to transform one string into another.

Use Case: Ideal for identifying exact differences between two names, such as detecting typos or minor variations.

Example:

DECLARE @QueryString NVARCHAR(256)='Aron'
SELECT  P.FirstName,EDIT_DISTANCE(p.FirstName, @QueryString) as ED_Distance
FROM dbo.Persons AS P
WHERE EDIT_DISTANCE(p.FirstName, @QueryString) <3

Result:

FirstNameSearchStringDistance
ArunAron1

This indicates that only one edit (substitution of 'a' with 'o') is needed to transform "Aron" into "Arun".

2. EDIT_DISTANCE_SIMILARITY

Description: Returns a similarity score between 0 and 100 based on the edit distance, where 0 indicates no similarity and 100 indicates an exact match.

Use Case: Useful for quantifying the similarity between names, especially when determining thresholds for matches.

Example:

DECLARE @QueryString NVARCHAR(256)='Aron'
SELECT  P.FirstName,EDIT_DISTANCE_SIMILARITY(p.FirstName, @QueryString) as ED_Distance
FROM dbo.Persons AS P

Result:

FirstNameSearchStringSimilarity
ArunAron75

A similarity score of 75 suggests a high degree of resemblance between the two names.

Another example:

FirstNameSearchStringSimilarity
AlokAron0

Here score 0 indicates no similarity between search string and name.

3. JARO_WINKLER_DISTANCE

Description: Calculates a distance metric between two names using the Jaro-Winkler algorithm, which gives more weight to names that match from the beginning for a set prefix length.

Use Case: Effective for comparing shorter names, where common prefixes are significant.

Example:

DECLARE @QueryString NVARCHAR(256)='chris'
SELECT  P.FirstName
,JARO_WINKLER_DISTANCE(p.FirstName, @QueryString) as JW_Distance
FROM dbo.Persons AS P
WHERE JARO_WINKLER_DISTANCE(p.FirstName, @QueryString)<.3

Result:

FirstNameJW_Distance
Kriss0.2666667
Christoph0.2518519
Christina0.2518519
Christian0.2518519
Christopher0.2787879
Chris0.1333334

Please note that a lower distance value indicates higher similarity.

4. JARO_WINKLER_SIMILARITY

Description: Returns a similarity score between 0 and 1 using the Jaro-Winkler algorithm, where 0 indicates no match and 1 indicates an exact match.

Use Case: Suitable for applications requiring a normalized similarity score, particularly when matching names.

Example:

DECLARE @QueryString NVARCHAR(256)='chris'
SELECT DISTINCT P.FirstName
,JARO_WINKLER_SIMILARITY(p.FirstName, @QueryString) as JW_Similarity
FROM dbo.Persons AS P
WHERE JARO_WINKLER_DISTANCE(p.FirstName, @QueryString)<.3

Result:

FirstNameJW_Similarity
Christoph0.748148
Kriss0.733333
Christian0.748148
Chris0.866667
Christina0.748148
Christopher0.721212

A similarity score close to 1 indicates a strong resemblance between the names.

Case Sensitivity in Fuzzy Matching

These functions are case-sensitive, meaning that variations in capitalization can change similarity scores significantly. For example, changing 'Aron' to 'aron' can increase the edit distance, leading to different results. If case-insensitive comparisons are required, one approach is to convert both the input and stored values to the same case using LOWER() or UPPER(), like this:

DECLARE @QueryString NVARCHAR(256)='aron';
SELECT P.FirstName, EDIT_DISTANCE(LOWER(P.FirstName), LOWER(@QueryString)) AS ED_Distance
FROM dbo.Persons AS P
WHERE EDIT_DISTANCE(LOWER(P.FirstName), LOWER(@QueryString)) < 3;

This ensures that case differences do not impact the similarity score., meaning that variations in capitalization can lead to different results. For example, changing 'Aron' to 'aron' can significantly affect the similarity score.

Example:

DECLARE @QueryString NVARCHAR(256)='aron'
SELECT P.FirstName, EDIT_DISTANCE(P.FirstName, @QueryString) AS ED_Distance
FROM dbo.Persons AS P
WHERE EDIT_DISTANCE(P.FirstName, @QueryString) < 3;

Result:

FirstNameSearchStringDistance
Arunaron2

Since 'Aron' and 'aron' have different cases, SQL Server treats them as more distinct, leading to a higher edit distance.

Choosing the Right Function

  • EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY: Best for scenarios where the exact number of edits or a percentage-based similarity is needed. Suitable for longer names where specific differences are important.
  • JARO_WINKLER_DISTANCE and JARO_WINKLER_SIMILARITY: Preferable for shorter names, where common prefixes are significant. These functions are more sensitive to similarities at the beginning of names.

By leveraging these functions, developers can implement robust fuzzy matching and name comparison operations directly within SQL Server, enhancing data quality and user experience. Experimenting with these functions on different datasets and adjusting similarity thresholds can help optimize results for various use cases.

Rate

4.83 (6)

You rated this post out of 5. Change rating

Share

Share

Rate

4.83 (6)

You rated this post out of 5. Change rating