Need Fast ''Contains'' Search!

  • I've got an 11 million+ row table with demographic information about various businesses.  The application I am working on has a search feature that lets the users return rows from the table.  One of the criteria we let them specify is the Name of the business.  We have an index on the Name column, and up until now it's worked pretty well because all of the searches on that column have been 'begins with' searches using a dynamically generated LIKE <criteria>% in the where clause.

    The users are now asking for a 'contains' search in addition to the 'begins with'.  It's easy enough to implement, changing the LIKE to use %<criteria>%, but the performance is terrible because doing that causes a table scan of the 11 million rows!

    Can anyone advise me on a technique to create a faster performing search on a varchar field that has a leading wild card?  Any and all advice is welcome.  Thanks in advance.

    Joe

  • Very good question.

    I would venture to say, that the answer to this question is not a piece of sql code. However, maybe somebody will know rightout of SQL text indexing features.

    You may want to consider is creation of additional indeces. These indeces will be built by parsing the business names.

    Then the name to be searched for, would have to parsed as well and the pieces identified.

    and then instead of just one query, multiple queries would have to be written that use these indeces and search for the pieces. The pieces are usually ordered and the queries would search in that order too.

    My thinking is based on the fact that a business names is formed from more atomic words.

    For example, they may have prime words or class words and various other qualifier words, owners name, etc.

    So, the additional would be the business class name index or the prime word that describes the business, etc.

    Let's keep talking and see what comes up.

    Ben

  • Joseph,

    You might want to consider using SQL Server 2000 Full-text Search (FTS) and the CONTAINS or FREETEXT predicates. Specificly, search the BOL for "full text" (using the double quotes) and you will find examples of how to first create a FT Index, run a Full Population and then execute FTS queries, such as:

    select * from pub_info where contains(*, '"book*"')

    The above query will return all rows from the FT-enabled and populated pubs database table pub_info that contain book, books, booking, etc. However, this alone will not satsify your searchers request for a faster searching for "*ook" as SQL FTS only supports word-based trailing wildcards and not a precceeding "*" wildcard search as T-SQL LIKE does. However, as you discovered with T-SQL LIKE, it does support the preceeding "%" wildcard, but performs poorly as such a T-SQL LIKE query will always perform a table or index scan, regardless of the presence of a covering index.

    Grasshopper, what are you talking about? Could you provide specific examples? Without examples, you are just confusing the issue.

    Regards,

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/

     


    John T. Kane

  • I have had to implement something like this on a MSDE, where FT search is not included.

    I then added a column 'ReverseName' thats stores select reverse(Name). Whenever a search condition starts with a wildcard, I reverse the search string and search the ReverseName column instead. It may work if you have appropriate indexes in place!

    Ola

     

     

  • You are correct that SQL FTS queries using CONTAINS or FREETEXT are not supported with SQL MSDE 2000, so a pure T-SQL solution is your only option. However, with a table that contains 11 million rows, I have to think that you are coming close to the 2GB database size limit for MSDE 2000.

    Furthermore, I don't think that just reversing the text of your Name column and storing it as ReverseName (again, increasing the table size and getting closer to the 2 GB database size limit) will provide the solution you are looking for without having to provide both the leading and trailing T-SQL LIKE wildcard "%" symbols in your T-SQL LIKE query. If you cannot elimate the leading T-SQL LIKE wildcard "%", then the MSDE 2000 query optimizer will ALWAYS select a table or index scan on your 11 million row table.

    I've done a simple test using the pubs database and authors table storing the reverse of the column au_fname via  reverse(au_fname) and then reversing a test user string "ann" as there are two first names that start with ann in this table, i.e., ann and anne. I tested both leading and trailing T-SQL LIKE wildcard "%" symbols and got different results, as you can see below:

    use pubs

    go

    SELECT au_id, au_lname, au_fname, reverse(au_fname) as au_fname_rev

     INTO Authors_search

      FROM authors

    go

    -- Find all authors with the first name 'ann' using trailing %

    select * from Authors_search where au_fname like 'ann%'

    /* -- returns:

    au_id       au_lname                       au_fname             au_fname_rev        

    ----------- ------------------------------ -------------------- --------------------

    427-17-2319 Dull                           Ann                  nnA

    899-46-2035 Ringer                         Anne                 ennA

    (2 row(s) affected)

    */

    -- Find all authors with the first name 'ann' with leading & trailing wildcards

    select * from Authors_search where au_fname like '%ann%'

    /* -- returns:

    au_id       au_lname                       au_fname             au_fname_rev        

    ----------- ------------------------------ -------------------- --------------------

    427-17-2319 Dull                           Ann                  nnA

    899-46-2035 Ringer                         Anne                 ennA

    (2 row(s) affected)

    */

    -- using the reversed first name column (au_fname_rev), reverse the search word to find

    -- all author's first names that start with 'ann' using both leading and trailing wildcards

    -- using leading "%" wildcard with LIKE

    declare @search_word varchar(20), @rev_search_word varchar(20)

    set @search_word = 'ann'

    set @rev_search_word = reverse(@search_word)

    select * from Authors_search where au_fname_rev like '%'+@rev_search_word+'%'

    /* -- returns:

    au_id       au_lname                       au_fname             au_fname_rev        

    ----------- ------------------------------ -------------------- --------------------

    427-17-2319 Dull                           Ann                  nnA

    899-46-2035 Ringer                         Anne                 ennA

    (2 row(s) affected)

    */

    -- using the reversed first name column (au_fname_rev), reverse the search word to find

    -- all authors first names that start with 'ann' using both leading and trailing wildcards

    -- using NO leading "%" wildcard with LIKE

    declare @search_word varchar(20), @rev_search_word varchar(20)

    set @search_word = 'ann'

    set @rev_search_word = reverse(@search_word)

    select * from Authors_search where au_fname_rev like @rev_search_word+'%'

    /* -- returns:

    au_id       au_lname                       au_fname             au_fname_rev        

    ----------- ------------------------------ -------------------- --------------------

    427-17-2319 Dull                           Ann                  nnA

    (1 row(s) affected)

    */

    As you can see from this simple example that only by using both leading and trailing T-SQL LIKE wildcard "%" symbols and reversing the user's input string, can you find the first names ann and anne using a reversed column. There is a non T-SQL LIKE solution to this problem, but it will require defining and creating mutiple tables and parsing your 11 million row table's column text into unique word tokens, but this solution will most likely casuse your database to exceed the 2 GB limit of MSDE 2000. Let me know if you're interested in such a solution via email: jt-kane at comcast dot net.

    Regards,

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/

     


    John T. Kane

  • John (and everyone else),

    Thanks so much for your replies.  I will look into the FTS and Contains() functionality to see if that will help.  BTW, John, the 11 million row table I am working with is on a full version of SQL 2000, it was someone else in this thread that's using MSDE.

    If there are any other ideas, I'd love to hear them.  Thanks again to you all.

    Joe

  • John,

    Obviously, FT search is the proper choice.

    When reading Joseph's first post again, I realize he needs to put BOTH leading and trailing wildcards. My suggestion fits, when there is only a leading wildcard. SQL Server would then use an index if the table is "big enough" etc.

    Number 4 of your queries : select * from Authors_search where au_fname_rev like @rev_search_word+'%'

    finds all author names that ENDS with 'ann'. This is where the reversed word index can be useful.

  • Can the following work...

    Select * From Company WHERE

    CHARINDEX ('ABC',Name) > 0

    NOTE: if possible put the clustered index on the 'Name' column

  • I tried the CHARINDEX() route but it wasn't really fast enough.  Thanks for the suggestion though.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply