October 30, 2009 at 12:19 pm
Currently I am working on a database table that has a very large text field (we will call it ExtractedText). The query has to look for wildcard searches through the whole text, and determine how many records each term hits on.
Currently the basis of the query looks like this:
select ID FROM TABLE WHERE ExtractedText like '%BILL%' UNION ALL
select ID FROM TABLE WHERE ExtractedText like '%Tommy%' UNION ALL
select ID FROM TABLE WHERE ExtractedText like '%Willy%'
NOTE: Each term has to be run on each record, because I need the number of records each term hits on.
We are doing this for over 300 different terms on a large database, so days of running are not uncommon.
I am looking for ways to speed this up.
I have recently heard of Full Text indexes which may be a solution to the problem, but the results would have to be exactly the same as if I ran the above query. Would that be the case with a Full Text Index search?
Are there other techniques I can use to speed this up?
October 30, 2009 at 6:55 pm
huston.dunlap (10/30/2009)
Currently I am working on a database table that has a very large text field (we will call it ExtractedText). The query has to look for wildcard searches through the whole text, and determine how many records each term hits on.Currently the basis of the query looks like this:
select ID FROM TABLE WHERE ExtractedText like '%BILL%' UNION ALL
select ID FROM TABLE WHERE ExtractedText like '%Tommy%' UNION ALL
select ID FROM TABLE WHERE ExtractedText like '%Willy%'
NOTE: Each term has to be run on each record, because I need the number of records each term hits on.
We are doing this for over 300 different terms on a large database, so days of running are not uncommon.
I am looking for ways to speed this up.
I have recently heard of Full Text indexes which may be a solution to the problem, but the results would have to be exactly the same as if I ran the above query. Would that be the case with a Full Text Index search?
Are there other techniques I can use to speed this up?
1) Your query has no possible way of telling you how many records each term hit on. You will simply get a single collection of IDs that are NOT correlated with the term each hit on.
you can fix this (AND eliminate 2 table/index scans) in the above example with something such as this:
select sum(case when ExtractedText like '%BILL%' then 1 else 0 end) as billcount,
sum(case when ExtractedText like '%Tommy%' then 1 else 0 end) as Tommycount,
sum(case when ExtractedText like '%Willy%' then 1 else 0 end) as Willycount
FROM TABLE
WHERE ExtractedText like '%Tommy%'
OR ExtractedText like '%Willy%'
OR ExtractedText like '%BILL%'
2) if you index the big column you can at least do an index scan instead of a table scan. This could save a good bit of time (at the expense of a huge index) if the table has lots of other columns
3) full text indexing should be able to help with this query and you should set up a quick proof of concept
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 3, 2009 at 12:10 pm
Index scans will never scale, performance will always be directly related to table size, and there is no way I have found not to scan when using forward and backwards wild card searching. If there is anyway you can only wild card on the backside, you have a chance to seek the index. Many times this is not possible per requirements.
Again look at full text indexing, I don't feel there is a wealth of information out there, but there is enough of it to get you where you are doing. If you have a MS support contract the FTS group was very helpful to us with many questions we had. Especially look at 2008, where FTS is now not a separate component but integrated into the engine itself. You will get really good performance, but you in fuzzy search land which is an odd land to be in.
Here is a good one:
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/
Mr. Cotter always has some great insight on this topic.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply