I've had several clients now that have requested a stored procedure that will return records based on key words entered by a user. There can be one or many key words and all have to match in order to match the item or product. I used to do this using a LIKE '%keyword1%' AND LIKE '%keyword2%' type of query using Dynamic SQL.
Using the LIKE operator like this on a large table can make for a long running query since no indexes can be used. A search query on an auto parts table with 1.2 million items took over 30 seconds to run for one of my clients. I developed a technique that would index each word in the item description in another table and use that for a search procedure. The average run time on that procedure is now 1.5 seconds.
In this article I'm going to use the AdventureWorks database to build such a table and show how to use it. Keep in mind that there isn't much performance gain in this example since AdventureWorks had just over 500 products in it. ( I would never bother with this technique on a table that small.)
There are 2 tables that you have to create: KeyWordItems and KeyWordsToExclude. The first table has the primary key of the product or item table - in this case ProductID from Production.Product. The second table contains words that you want to ignore. This could be 'a', 'an', 'the' and so forth. Another type of word to ignore in most cases would be any single character word, which would exclude such things as '&' or "#" if there are spaces on either side.
Here is the code to create and populate the KeyWordsToExclude table:
CREATE TABLE KeyWordsToExclude ( KeyWord VARCHAR(50) NOT NULL ) GO ALTER TABLE KeyWordsToExclude ADD CONSTRAINT PK_KeyWordsToExclude PRIMARY KEY CLUSTERED (KeyWord) GO -- Populate the KeyWordsToExclude table with whatever you don't want to -- use. INSERT INTO KeyWordsToExclude ( KeyWord ) SELECT 'A' UNION ALL SELECT 'AN' UNION ALL SELECT 'THE' GO
Here is the KeyWordItems table:
CREATE TABLE KeyWordItems ( KeyWord VARCHAR(50) NOT NULL , ProductID INT NOT NULL ) GO ALTER TABLE KeyWordItems ADD CONSTRAINT PK_KeyWordItems PRIMARY KEY CLUSTERED (KeyWord, ProductID) GO
In order to populate the KeyWordItems table, you should have Jeff Moden's table valued function - DelimitedSplit8KNEW. It is lightning fast and I'm including it here for simplicity (I hope Jeff doesn't mind).
CREATE FUNCTION dbo.DelimitedSplit8KNEW --===== Created by Jeff Moden (Prototype: Testing Still in Progress) --===== Define I/O parameters ( @pString VARCHAR(8000), @pDelimiter CHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ), --10 E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100 E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000 cteTally(N) AS ( SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E4 ) SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY t.N), ItemValue = SUBSTRING(@pString,t.N+1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0),DATALENGTH(@pString)+1)-t.N-1) FROM cteTally t WHERE t.N BETWEEN 0 AND DATALENGTH(@pString) AND (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0) ; GO
Now we come to actually populating the KeyWordItems table. There are a few quirks in the AdventureWorks product names. Slashes are used to separate some words - e.g. 'LL Mountain Seat/Saddle'. Also some product names have commas after a word - e.g. 'Mountain Bike Socks, M'. This type of punctuation is not likely to be entered by a user searching by product name and so needs to be ignored. In the code below there are 2 REPLACE functions that will replace a '/' with a space (which makes it a separate word) and strip out commas.
SELECT RecID = IDENTITY (INT, 1, 1) , CAST(P.ProductID AS INT) AS ProductID, Split.ItemValue AS KeyWord INTO #PW -- ProductWords FROM Production.Product P CROSS APPLY (SELECT ItemValue FROM dbo.DelimitedSplit8KNEW(REPLACE(REPLACE(P.Name, '/', ' '), ',', ''), ' ')) AS Split WHERE LEN(ItemValue) > 1 -- Delete the words that are in KeyWordsToExclude (This could have been -- included -- above but done as a separate step to keep the SQL simple. In -- AdventureWorks there aren't any product names that will be excluded. DELETE PW FROM #PW PW INNER JOIN KeyWordsToExclude KWE ON PW.KeyWord = KWE.KeyWord -- Delete any product key words with the same word twice. Again -- AdventureWorks product names don't have the same word twice in them. -- Your data may differ. DELETE PW FROM #PW PW INNER JOIN (SELECT MIN(RecID) AS MinRecID, ProductID, KeyWord FROM #PW GROUP BY ProductID, KeyWord HAVING COUNT(*) > 1 ) AS X ON PW.ProductID = X.ProductID AND PW.KeyWord = X.KeyWord AND PW.RecID <> X.MinRecID -- Populate the KeyWordItems table. INSERT INTO KeyWordItems ( ProductID, KeyWord ) SELECT ProductID, KeyWord FROM #PW
It is VERY important to note that the SAME logic that is used to populate the KeyWordItems table must be used when handling the user input - i.e. the same punctuation handling, the same ignored words, the same restriction on no single letter words, etc. Otherwise you may end up where something that should match that won't match.
Once the key words table has been created it will have to be maintined either in a procedure or trigger when an update occurs to the column you are using for the words (item description, product name, etc.). If any change is made to that column the procedure or trigger should delete and re-add the key words using the same logic that was used to initially populate it.
Now for some actual queries.
Here is a typical query for 2 key words - usually created in Dynamic SQL - that would search the Products table.
SELECT ProductID, Name, Color, Size, ListPrice FROM Production.Product WHERE Name LIKE '%nut%' ANDName LIKE '%hex%'
This would return something like:
Here are the actual stats from SET STATISTICS TIME ON and SET STATISTICS TIME ON which was set before running the query above:
Table 'Product'. Scan count 1, logical reads 23, physical reads 3
, read-ahead reads 0, lob logical reads 0, lob physical reads 0
, lob read-ahead reads 0.
SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 270 ms.
The important point above is the table scan on the Product table. This is what the query plan for the above query looks like:
Here is the procedure that will do the search using the new key word tables:
CREATE PROCEDURE ItemSearch @KeyWords VARCHAR(256) AS SET NOCOUNT ON; -- We need to know how many words the user entered for the search. DECLARE @KWCount INT -- Temp table of passed in key words CREATE TABLE #KW ( RecIDINT IDENTITY(1, 1) , KeyWordVARCHAR(50)) -- Same function and same logic used to populate KeyWordItems table. INSERT INTO #KW ( KeyWord ) SELECT ItemValue FROM dbo.DelimitedSplit8KNEW(REPLACE(REPLACE(@KeyWords,'/',' '),',',''),' ') AS Split WHERE NOT EXISTS (SELECT 1 FROM KeyWordsToExclude KWE WHERE KWE.KeyWord = Split.ItemValue) -- Delete any duplicate words. Same as was done for the KeyWordItems -- table. DELETE KW FROM #KW KW INNER JOIN (SELECT MIN(RecID) AS MinRecID, KeyWord FROM #KW GROUP BY KeyWord HAVING COUNT(*) > 1 ) AS X ON KW.KeyWord = X.KeyWord AND KW.RecID <> X.MinRecID -- Number of words in the temp table. SELECT @KWCount = X.RecCount FROM (SELECT COUNT(*) AS RecCount FROM #KW ) AS X -- Create a little stub table from the temp table joined to the -- KeyWordItems table. Using this stub table avoids -- joining to the larger Product table multiple times and then -- aggregating the results. The GROUP BY also does a sort on ProductID -- so that this stub table and the main Product table are eligible -- for a MERGE JOIN. You may have to index it in order to get a -- MERGE in the query plan. -- The HAVING part means that the product name must match the #KW table -- the same number of times as the count of words. This produces the -- same effect as the AND operator in the original LIKE query. SELECT KWI.ProductID INTO #KWP -- Stub table of Products with the matching key words. FROM #KW INNER JOIN KeyWordItems KWI ON #KW.KeyWord = KWI.KeyWord GROUP BY KWI.ProductID HAVING COUNT(*) = @KWCount -- Return the requested Products SELECT P.ProductID, P.Name, P.Color, P.Size, P.ListPrice FROM #KWP KWP INNER JOIN Production.Product P ON KWP.ProductID = P.ProductID GO
Now we run the stored procedure for the same key words:
EXEC ItemSearch @KeyWords = 'nut hex'
This produces the same results as the original LIKE query. Here are the stats for the procedure:
Table 'Product'. Scan count 0, logical reads 78, physical reads 2
, read-ahead reads 0, lob logical reads 0, lob physical reads 0
, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 175 ms.
The important part of this is that there is no Scan count on the Product table. Here is what the actual query plan looks like:
Note that we get a Clustered Index Seek on the Product table instead of a scan.
As I mentioned before, since AdventureWorks only has just over 500 products, this technique doesn't do much. When I used it against an item table of auto parts of over 1.2 million the performance differences was astounding.
This technique is most useful for searches where all criteria (and AND condition) must be met in order to qualify for a match. I hope someone finds this useful.
Todd Fifield