1.5 Million Rows. How to Optimize Query?

  • Hi guys,

    I am asking this question on behalf of a friend. I have little knowledge of SQL 2005 but my friend is quite knowledgeable, although this is the first time he is dealing with large database for a client. So here's the story.

    His client has a database containing 1.5 million books. Now he is setting up a website which will enable users to search books. Searching by ISBN is no problem as it only takes 1 second. The problem is, searching by Title takes more than 20 seconds, which is unacceptable. My friend has only done smaller database and he just recently thought of implementing indexing and now looking for other ideas.

    Each row contains book details such as Title, Author1, Author2, Author3, Publisher, Publication Date, ISBN, etc.

    Can anyone who are more experienced in doing large database share with me some design ideas? His client is aiming for 8seconds or less.

    Thanks in advance!

  • do you have any idea what form the queries are taking? Is he trying to offer keyword-based searching of the titles? Exact matches? Topical matches? Phrase-matching?

    Knowing what types of queries are allowed, along with the DB structure would be useful. And since we're talking speed - the indexing scheme would be critical to know as well.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • As Matt mentioned, there is lots to look at. The specific query will matter, but indexing is probably more important. Check for indexes on the title or other relevant fields.

    Also, author1, author2, etc isn't normalized and would increase the time for searching by author. You've have to search for "Steve Jones" in both author1 and author2, using an OR clause.

  • I agree with Matt that there is not a lot of advice that can be given with the generality of the question. Steve also makes an accurate point, that if this a "wide" table with each author in its own column will reduce performance as well.

    Here a few general tips:

    1. Determine if the data is frequently updated or static. Be more aggressive with indexes if static as it will improve query performance and you do not have to be as concerned with insert/update/delete performance.

    2. Determine what columns are going to be most used for searches and prioritize indexing those columns. I would venture to guess that most searches will be by title and\or author and then genre.

    3. If using Like in the query I recommend making the search a "Starts With" search as leading wildcards cause scans while trailing wildcards still allow for seeks.

    4. Limit data returned by the search so you are better able to create "covering" indexes and reduce volume over the network. I would limit to Title, lead author, ISBN, publisher, and publication date with drill-down based on ISBN.

    Really indexing and performance is as much art as science.

  • If he already has a design that has 1.2 million rows and he's "considering indexing" then he's way behind the eight-ball already. I'd suggest focusing on one thing, clustered indexes. I suspect, based on the limited information provided, that there will be either no clustered indexes at all or all the clusters will be on the PK which is probably an identity value and not used in most of the queries. Spend time figuring out how best to distribute the data across the cluster. What parts of the data are most likely to benefit from types of grouping that clusters provide. Don't hesitate to use a compound index, multiple columns, as the cluster.

    If you want detailed help, execution plans, the query itself and as much of the data model as can be released into the public domain will all really make it easier to help.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks guys for all the response.

    I just receive some info from my friend about how to optimize speed and wonder if this indeed can work. Anyway, the table is already narrowed just to contain primary key which is the ISBN and title. It is search by Title.

    eka: The solution is to create index of words that belongs to a book which include words in title, author etc. One assumption using this solution is , user have to enter the search text word in complete in which case we do exact match for the words and identify all books related to words.

    In addition creating indexes for the indexes in database will improve the performance greatly.

    But problem, is creating such index by parsing through all books title and author to be done using some external program This program will do complete scan of ttlingv2 table and generate word index.

    Using this index in 'keyword' or 'title' search will be fast as we looks for complete word and not part of. But you might see that creating word of indexes for all title is going to be large and not sure whether we will face any other problem because of that.

    Writing the program to create word index takes time and also current sql query will all change as it need to be rewritten to use this word index.

    Is it overwhelming? Did I convey you what I meant?

    Maz: yes

    i think what you are saying is to extract each word from the title, am i right?

    eka: Yes. All word leaving articles and other common words.

    So there will be one separate table to maintain word index and this will be reloaded every time when we load the db with latest data from ingram.

    Maz: how sure are you that this will increase speed?

    and how long will it take to implement?

    eka: This table will be indexed and when we search for the word, we do exact match for the word and not the part of the word/ broken down text. The part of text is what we do currently for title search and author.

    Will this work?

    Thanks!

  • Tell your friend to look up full-text indexing. There's no need to reinvent the wheel. The built-in functionality of Full-text indexing will do just that: no need for any external program, etc....

    As to his indexes: having the title as the second column in the clustered index will do exactly nothing for him when he's searching by title.

    That means the entire table is being scanned for the title searches, and thus the slowness. At best he would need a second index, but still - it doesn't sounds like he's doing exactly title matches, so traiditional indexes aren't going to help. Full-text indexing is the way there.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • So, if I understand this correctly, the plan is to break down a title into it's component words, store each word in seperate table with a link back the title table and then perform any searches against this "word" table? What if the user types in two words?

    It sounds like tons of work to avoid using the SQL Server Full-Text. Why not pursue that?

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks to Matt and Grant!

    Using the Full-text index, my friend has been able to reduce the query speed by more than half.

    But there was a problem. The time taken is still quite long and he is uncertain about some options since this is the first time he's using full-text.

    When I was created full-text index, it asked for 'track change option' to be set manual, automatic, do-not. I set as do-not. I am not sure whether this makes any difference. Also the index got created in no time, I am in doubt whether if at all it got created though there is improvement in search time.

    Thanks!

  • To determine what the queries are doing your friend needs to look at the execution plans. That will identify where the time is being spent and suggest possible solutions.

    Also, I may be wrong, but I think he needs to update the indexes either as the changes occur or incrementally over time. Not updating them means as the data changes the index goes out of date.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Also - keep in mind that unless he also changed the query to USE the full-text index, chances are - it's not using it. Full-text uses different operators to do the querying, so you'd have to use things like CONTAIN, etc... to tap into it fully.

    Checking the exec plan like Grant mentioned will tell you if it's being used.

    Unless the table gets a LOT of updates - I'd set the Track changes to automatic. Stale indexes will make you return inaccurate info.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • This is the query:

    Select *

    from

    (

    select (top 500 row_number() OVER (ORDER BY(rank desc) as resnum, rank, description, table2.col, table3.col, table4.col

    from table1 inner join FREETEXTTABLE(table1, description, 'application') AS ft ON table1.id1 = ft.)

    LEFT JOIN table2 ON table1.id2 = table2.id1

    LEFT JOIN table3 ON table1.id3 = table3.id1

    LEFT JOIN table4 ON table1.id4 = table4.id1

    ) AS Results

    WHERE resnum BETWEEN 21 AND 30

    In the above query, table2, table3 table4 contains required supplementary data.

    Table 1 is where we do full text search for description.

    We retrieve only 10 records at a time to incorporate pagination and also in order to limit number of records that are handled on hand.

  • sherlockian21 (5/30/2008)


    This is the query:

    Select *

    from

    (

    select (top 500 row_number() OVER (ORDER BY(rank desc) as resnum, rank, description, table2.col, table3.col, table4.col

    from table1 inner join FREETEXTTABLE(table1, description, 'application') AS ft ON table1.id1 = ft.)

    LEFT JOIN table2 ON table1.id2 = table2.id1

    LEFT JOIN table3 ON table1.id3 = table3.id1

    LEFT JOIN table4 ON table1.id4 = table4.id1

    ) AS Results

    WHERE resnum BETWEEN 21 AND 30

    In the above query, table2, table3 table4 contains required supplementary data.

    Table 1 is where we do full text search for description.

    We retrieve only 10 records at a time to incorporate pagination and also in order to limit number of records that are handled on hand.

    You're joining a multi-million row table to a function?

    join FREETEXTTABLE(table1, description, 'application') AS ft ON table1.id1 = ft.)

    Depending on the function, this could be the number one cause of the problem. Is it a multi-statement function? You need to look at the execution plan. You're either going to see the code within the function expanded out as if it were simply a part of this select statement, or you're going to see a scan against the function. If you see the scan, there's a problem.

    EDIT:

    Ignore the stupid person and his stupid statement.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • You're joining a multi-million row table to a function?

    join FREETEXTTABLE(table1, description, 'application') AS ft ON table1.id1 = ft.)

    Is there a more efficent way to use Full text indexing?

    I am only new to the topic, but this method is the one used in BOL.

  • Whoops. I'm a moron.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply