How to best design a database schema for fast text searching

  • The schema that I am designing has several tables that contain fields for holding large amounts of ascii text.  These tables will be searched very frequently.  My question is threefold :

    1) When should I use a varchar, and when should I use a text field?

    2) Should I put these fields in their own tables?

    3) What kinds of things can I do to make searching as fast as possible?

    Please keep in mind that I am not a DBA, and don't understand a lot of the jargon. 

    Thanks.

  • If your designing a new system, I would suggest having your company (no offense to you) get a True DBA, to get the application off the ground on the right foot. If thats not possible..

    Use the datatype thats right for the data. Thats it.

    If you have large blob, that exceeds 8000 characters, then you have no choice but to go with text field.

    What is the nature of this data your going to store?

    if it is Newspaper articles, Book text, whitepapers and such then its okay to store in large text fields. You also might want to leverage microsofts full text indexing. Or third party product Turbo search.

    If you are using text data type, leave in the same table, but do not select the In row option read books online for more info on this.

    If it is names and addresses etc, then the data needs to be normalized into a proper database table structure.

     

  • Thanks for the help, Ray.

    As far as text vs. varchar - most of these fields will be constrained to less than 8k.  In fact, most are less than 1k.  I was just wondering if searching might be more efficient with a text field instead of a varchar.  Also, are there other considerations for choosing one type over the other?  For example, I know about row size limitations.  But what about operators?  I know that there are some operators that you can use on varchar that you can't use on text.  Are there operators that you can use on text that you can't use on varchar?

    In regards to moving the fields to different tables, here is my line of thinking: the tables that contain the text/varchar fields also contain other information.  For example, one table holds data relating to a user's profile.  If these fields are being searched on a lot, will this slow down other queries that operate on the same table but don't necessarily use the text/varchar fields?  Is there enough of a performance gain to justify making my schema more complex?

    I've seen many articles about full text indexing, but most of them deal with how to use it, and not with what it actually is or does.  Can you give me a quick blurb on it?

    And as far as getting a True DBA (capitalized and everything), you have no idea how much I'd like to.  However, budget constraints being what they are.....

  • I'm going to post direct replies to your questions, vs. a grand strategy...

    1) When should I use a varchar, and when should I use a text field?

    A. You should use varchar when you want to store less than 8000 bytes of text, anything longer should be stored in a text field.

    2) Should I put these fields in their own tables?

    A. For text stored in varchar columns, they should be placed in your base tables. If you are using text fields, then you can place this text in a "child" table with a pointer to the "parent" or base table and maintain a Primary key to Forigen key relationship between the two tables.

    3) What kinds of things can I do to make searching as fast as possible?

    A. Several things... If you use T-SQL LIKE, you should place a non-clustered index on the varchar fields and then only use trailing wildcard, such as LIKE 'search%' and this will return search, searched searching, etc. This will ensure that the non-clustered index will be used and therfore the text searching will be fast. However, if you have requirements for a leading wildcard (LIKE '%search%') then SQL Server will not use the non-clustered index and therfore performance will be much slower relative to the number of rows in your base tables.  Additionally, you can implement "Full-text Search" (FTS) to overcome the performance hit with TSQL LIKE, however FTS is a word based search method vs. LIKE's pattern search method and you may get different results from your text searches.

    If you decide to persue SQL FTS, you should review all the links of "SQL Server 2000 Full-Text Search Resources and Links" at:

    http://spaces.msn.com/members/jtkane/Blog/cns!1pWDBCiDX1uvH5ATJmNCVLPQ!305.entry

    Regards,

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/


    John T. Kane

  • Thank you, John.  I will definitely be checking out those links.  So, tell me - is FTS as fast or faster than T-SQL LIKE, or are there situations when LIKE (combined with some sort of indexing) is faster?  I am leaning toward FTS because it seems to offer more flexibility, but if LIKE has a performance edge, I may consider using that instead.

  • You're welcome, Afro,

    Much like the "flight envelope" of an aircraft that defines the flight characters of that plane, i.e., at such an altitude and speed the plan is flyable, and outside that that flight envelope it fails to fly... Comparing T-SQL LIKE and FTS is much like that flight envelope as at the lower end of the spectrum (low number of rows [altitude] and low number of words per row [speed]) LIKE will be faster than FTS. However, as you gain altitude (more rows) and speed (more words per row) FTS will be faster. The exact parameters depend upon your server's hardware and software configuration (see above links) as well as the total number of rows you plan to FT Index. A SWAG of these numbers is that LIKE will most likely be faster with tables that have less than 100,000 rows, while FTS will be faster with more that 100K rows. Note, if you plan to FT Index more than 2 million rows, be advice that FT Indexing and FT Search will need to be tuned, e.g., read the above links!

    Enjoy!

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/


    John T. Kane

  • Thanks for the advice, John.  One last concern -

    New rows added to the table will not show up in a search until the full text index is rebuilt, correct?  How resource intensive is the task of rebuilding the index?  For example, say you had a table with 1M rows, and two text fields on which you had a fulltext index.  Assuming that you are running SQL2K on a resonably fast machine, how long do you think it would take to rebuild the index?  I know that this is a pretty specific question, but I'm basically just looking for a ballpark figure here.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply