February 28, 2011 at 11:16 am
#1 This will be faster than full text search for large loads (hundreds of queries per minute) on a db with millions of records/documents to search.
I did this exact thing in Oracle in 1999. We loaded the entire database of U.S. - Books In Print from the Library of Congress and built a search engine on that for a website and a cash register slash inventory management system for small book sellers.
Full text search was too slow for a high load environment, and I'd bet money on the fact that SQL Server is the same. You just can't push hundreds of full text searches per minute with a high number of documents. The method here will be faster.
------------------------------------------------
#2 This method doesn't allow wildcards
We dynamically built a SQL query for each keyword. This allows for the wildcards, and for special processing to allow wildcard suffixes. (See #3)
------------------------------------------------
#3 By reversing the kewords in the keywords table, you get a new feature no database has (without a full table scan)
One additional thing I did is to reverse all the keywords in the keywords table. Since Wildcard LIKE searches were allowed on the keywords table (like in full text search), having each keyword spelled backwards allowed a wonderful feature:
Search keywords: "hold her %ly"
[search for cheesy love novels: "hold her gently", "hold her gingerly", "hold her tightly"]
Normally for a wildcard search to use an index, the wildcard can't be at the begining (ex: giv%, matching give, giving, giver), otherwise a full table scan results on the keyword table. With a keyword table containing hundreds of thousands of rows that slows things down considerably. Reversing the keywords in the keyword table allows the keyword parser to detect wildcards and construct a where clause that includes the keyword reversed:
select <cols> from document where docid in (
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where keyword LIKE 'hold')
union
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where keyword LIKE 'her')
union
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where reversekeyword LIKE 'yl%' --note keyword spelled backwards)
)
There are number of different ways to do the query above. I just presented the easiest to grok. One other way that is interesting is to do a group by:
select <cols> from document d where docid in (
select docid from dockeyword dkw
inner join keyword kw on dkw.keywordid = kw.keywordid
where keyword = 'hold' or keyword = 'her' or reversekeyword like 'yl%'
group by docid
having count(docid) = 3
select
February 28, 2011 at 11:28 am
Thanks everyone for your comments. I should have mentioned in the article about Full Text searches in SQL Server.
As one poster mentioned, it's kind of a pain in the neck to start with. It's a can of worms that doesn't necessarily need to be opened for just one simple application where you know exactly the type of user input you're going to get - i.e. very structured user input. In 12 years of programming SQL Server I have only had 1 client that even had Full Text enabled. This auto parts dealer is really huge and the didn't have it.
This technique certainly won't handle every kind of search and wouldn't be appropriate for many applications. For one, if the data has a lot of noise punctuation and you don't really know what all is there so you can't predict it.
As I mentioned in the article, this is for item/product type descriptions. This would be in a very controlled environment where most everything is known about the noise words ahead of time.
In any case, I'm grateful for the discussion.
Todd Fifield
February 28, 2011 at 11:33 am
tfifield (2/28/2011)
Thanks everyone for your comments. I should have mentioned in the article about Full Text searches in SQL Server.As one poster mentioned, it's kind of a pain in the neck to start with. It's a can of worms that doesn't necessarily need to be opened for just one simple application where you know exactly the type of user input you're going to get - i.e. very structured user input. In 12 years of programming SQL Server I have only had 1 client that even had Full Text enabled. This auto parts dealer is really huge and the didn't have it.
It certainly won't handle every kind of search and wouldn't be appropriate for many applications. For one, if the data has a lot of noise punctuation and you don't really know what all is there so you can't predict it.
As I mentioned in the article, this is for item/product type descriptions. This would be in a very controlled environment where most everything is known about the noise words ahead of time.
In any case, I'm grateful for the discussion.
Todd Fifield
I'm still interested to know how you're handling plurals, tradenames, abbreviations, etc.?
February 28, 2011 at 11:34 am
quickdraw (2/28/2011)
#3 By reversing the kewords in the keywords table, you get a new feature no database has (without a full table scan)One additional thing I did is to reverse all the keywords in the keywords table. Since Wildcard LIKE searches were allowed on the keywords table (like in full text search), having each keyword spelled backwards allowed a wonderful feature:
...
Normally for a wildcard search to use an index, the wildcard can't be at the begining (ex: giv%, matching give, giving, giver), otherwise a full table scan results on the keyword table. With a keyword table containing hundreds of thousands of rows that slows things down considerably. Reversing the keywords in the keyword table allows the keyword parser to detect wildcards and construct a where clause that includes the keyword reversed:
select <cols> from document where docid in (
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where keyword LIKE 'hold')
union
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where keyword LIKE 'her')
union
select docid from dockeyword where keywordid IN (SELECT keywordid from keyword where reversekeyword LIKE 'yl%' --note keyword spelled backwards)
)
That is a great suggestion, reversing keywords. Nice.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
February 28, 2011 at 11:41 am
The product I support actually works somewhat similar to the way the article says to do it. One of the major difference is that the list of excluded words is coded into the proc, not stored in a table. Another is that we do a sequential check on the keywords. IE - Get a list of items with keyword 1 then compare to a list of items with keyword2 etc. It'll be interesting to see what kind of performance bump happens from doing the check in a set based manner instead of procedurally. We do have a benefit in that the list of terms is delivered to us on a quarterly basis so we get updates already split out so no work is required for us to do that.
February 28, 2011 at 12:07 pm
We added a synonyms table that cross referenced common abbreviations, common misspellings, and plural forms. We also built our own table of stop words to strip from the list of keywords before beginning the search.
February 28, 2011 at 1:37 pm
Useful solution if FTS is not enabled or or is inaccessible for other reasons. Many ISPs in the beginning did not offer FTS, for example.
When designed properly, FTS can be the fastest solution with stemming, stop words, synonyms, etc., that can be made into a sophisticated natural language feature. FTS indexes can be offloaded to faster SSDs or similar fast access devices. Lastly, FTS can be fully automated. The only problem I had was migrating an FTS from SQL 2005 to SQL 2008. Since I had all the scripts, I just recreated that part.
March 1, 2011 at 2:21 pm
quickdraw (2/28/2011)
#1 This will be faster than full text search for large loads (hundreds of queries per minute) on a db with millions of records/documents to search....
Full text search was too slow for a high load environment, and I'd bet money on the fact that SQL Server is the same. You just can't push hundreds of full text searches per minute with a high number of documents. The method here will be faster.
------------------------------------------------
#2 This method doesn't allow wildcards
We dynamically built a SQL query for each keyword. This allows for the wildcards, and for special processing to allow wildcard suffixes. (See #3)
------------------------------------------------
#3 By reversing the kewords in the keywords table, you get a new feature no database has (without a full table scan)
One additional thing I did is to reverse all the keywords in the keywords table. Since Wildcard LIKE searches were allowed on the keywords table (like in full text search), having each keyword spelled backwards allowed a wonderful feature:
Search keywords: "hold her %ly"
[search for cheesy love novels: "hold her gently", "hold her gingerly", "hold her tightly"]
re: #1: We've been using FTS in SS2005 and SS2008 on moderate sized tables (about 4M items) with high concurrency and it performs very well. The key (for us) was to isolate the search results from the rest of query. For example:
SELECT ... FROM A
JOIN (SELECT key FROM A WHERE CONTAINS( search )) as S on S.key = A.key
re: #2: The search predicate syntax is a bit involved so we normalize the user supplied text before using it as a search expression. For example, 'IRON MAN' becomes '("IRON*" and "MAN*") or "IRONMAN*"' and that meets most of business needs. We also strip out any punctuation and support special prefixes that the user app can send to modify the search behavior (e.g.: '>IRON MAN' becomes "IRON MAN*" which looks for an item with a word starting with IRON and any following word starting with MAN). We also let "power users" enter raw search expression.
re: #3: Our full text indexes are based on indexed views which merge several columns together and also clean up the indexed text.
CREATE VIEW IndexedConsultant WITH SCHEMA_BINDING
as SELECT *, replace(firstName+ ' '+ lastName+ ' '+ roleTitle+ ' '+ contractReviews + placementNotes,'+',' ')) as combinedText
FROM Consultant ...
This improved search behavior and performance for our app since we only needed to search one column and could locate items where the matches are not necessarily from a single column. In the following example any consultant's named Jones who are Engineers will be found along with any consultants who might have been placed at Jones Engineering.
SELECT as consultantId FROM CONTAINSTABLE(IndexedConsultant,dbo.fNomalizeSearch('Jones Engineer'))
You could add your reversing logic to the column definition of the indexed text.
All of our full text searches are handled by UDFs with no dynamic SQL required. The results can then be joined back to their base tables and the logic easily called by user app, included in stored procedures, or included in more complex searches. So, I suggest you should take another look at the built-in FTS capabilities because it would be difficult to match it's flexibility, speed, and code-clarity.
March 1, 2011 at 3:36 pm
Thanks antonio. You've given some great info.
Currently I'm using dynamic SQL, but intended to convert it one day. Now I know it can be done. Awesome!!!
Your use of FTS on a calculated field in a view is a new concept to me. Contains can query multiple separate fields or a whole table, but requires more setup. Did you test performace?
Any chance youd share your normailze query function. I do the same but simplistically. All punctuation except numberics and hyphens are converted to space, then split on spaces to join all the terms with AND. My users search for "IT", so I expand it to "information technology" since its a stop word otherwise.
March 1, 2011 at 7:52 pm
@ Antonio,
Just so I can file this away in my brain for later:
What is your max sustained throughput (total users and max queries per minute)? How many CPUs?
Thanks.
March 1, 2011 at 7:56 pm
@ KermitTheRock,
Indexed views rock! Keep in mind though, you need SS Enterprise to get this feature. Most ISP licences are Standard so unless you paid the big nut for Enterprise you'll have to do something else like a computed column in the table itself.
March 2, 2011 at 3:39 pm
Actually, Indexed Views ARE available in the Standard edition of SQL Server. You just have to manage them yourself, using explicit query hints.
From http://msdn.microsoft.com/en-us/library/ms187864.aspx:
"Indexed views can be created in any edition of SQL Server 2008. In SQL Server 2008 Enterprise, the query optimizer automatically considers the indexed view. To use an indexed view in all other editions, the NOEXPAND table hint must be used."
Kevin
--
Please upgrade to .sig 2.0
March 3, 2011 at 1:48 pm
sqlnoobie (2/28/2011)
tfifield (2/28/2011)
I'm still interested to know how you're handling plurals, tradenames, abbreviations, etc.?
sqlnoobie,
I wouldn't use this technique if I had to handle abbreviations, trademarks, etc. This technique works very well for item/product descriptions in a controlled environment. The first case was item descriptions for wine and the second one was for item descriptions for auto parts. None of these had abbreviations or other shortened words.
The users would enter something like 'cabernet 2001' and expect to get all wines from 2001 that had the word cabernet in the descripton. Likewise a user would enter 'brake pads' and expect to see a list of all items with brake pads in the description.
Todd Fifield
March 3, 2011 at 2:23 pm
Todd,
This is why FTS is hard. splitting some text by whitespace, building a list of "words" and pointers back to the text isn't difficult.
Knowing when and how to group a some characters as a word is hard, detecting mispellings is hard. MSSQL handles the first using its parsing methods, but does not handle the second at all. Perhaps all queries need to be spell checked before going to the index.
March 4, 2011 at 12:59 pm
KermitTheRock (3/3/2011)
Todd,This is why FTS is hard. splitting some text by whitespace, building a list of "words" and pointers back to the text isn't difficult.
Knowing when and how to group a some characters as a word is hard, detecting mispellings is hard. MSSQL handles the first using its parsing methods, but does not handle the second at all. Perhaps all queries need to be spell checked before going to the index.
Kermit,
For the applications I was doing this for, misspellings and plural/singular weren't an issue.
Take wine snobs ordering on-line. If they type in blanc (French) they don't want to see blanco (Spanish). If they had a typo, they wouldn't get anything back in some cases and realize that they had misspelled the word and re-enter the search criteria. Since the search is so bloody fast, they haven't complained yet about not finding misspelled words.
Stores ordering auto parts. Some items are very different if plural. They type in brake drum (singular) and that's what they want to see - all items with brake drum in the description, which would be packaged as a single brake drum. This item is also sold as a set under a different item number. They type in brake drums and they see all items sold as a set.
In the auto parts application especially the users were very pleased since the query using the LIKE operator used to take over 30 seconds to complete and they would see both single items and kits if they typed brake drum. Using the technique in the article the average time was 1.5 seconds to return the item list for them to pick from.
Todd Fifield
Viewing 15 posts - 16 through 30 (of 38 total)
You must be logged in to reply to this topic. Login to reply