Performance Question

  • I have a very large table called People, with a field (int) named PersonID serving as the primary key. There is also a field (int) named Status. The table has a non-clustered index on the Status field.

    Currently there are over 100,000 records in the table.

    The following query takes forever:

    select top 1 Name from People where Status = 0

    However, if I do this:

    select top 1 PersonID from People where Status = 0

    It comes back in a flash. Suppose it returns 94837. If I then do:

    select top 1 Name from People where PersonID = 94837

    This too executes in zero time.

    Can anyone explain this behavior?

    Why doesn't the original query work just as fast?

    Thanks,

    Jamie

  • I built a million row table as you described. I made it so that the Status column has only 5 different possiblities and verified the distribution as an average of 20,000 rows for each value of Status.

    All 3 queries return in such a short time that they all register as 0 milliseconds.

    The only way I can see things happening as you claim (first query taking forever), is if you have a huge rowcount differences between a small number of statuses or the table isn't actually a table... it's a view or a synonym pointing to a view.

    How long are we talking about when you say the first query takes forever? And is that query the actual query that takes forever or did you remove a variable from the where clause, because even with no indexes, the first query finds the answer in less than 15 milliseconds on my machine.

    Also, do you have at least Service Pack 2a Installed? I hear that earlier versions of 2k5 had some problems with performance here and there.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jamiejulius (3/12/2008)


    The following query takes forever:

    select top 1 Name from People where Status = 0

    However, if I do this:

    select top 1 PersonID from People where Status = 0

    It comes back in a flash.

    How long does this take?:

    select top 1 Name from People where Status = 0

    Order by Name

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • jamiejulius (3/12/2008)


    I have a very large table called People, with a field (int) named PersonID serving as the primary key. There is also a field (int) named Status. The table has a non-clustered index on the Status field.

    1) select top 1 Name from People where Status = 0 -- slow

    2)select top 1 PersonID from People where Status = 0 -- fast

    3) select top 1 Name from People where PersonID = 94837 -- fast

    Can anyone explain this behavior?

    Why doesn't the original query work just as fast?

    Thanks,

    Jamie

    Ok, so I've paraphrased the original quote for clarification...

    Let's start with query 2 first. It's faster because the column in the select list is the clustered index (PK). Therefore it doesn't have to sort the table for the first row.

    In query 3 the clustered index is used for the sort and only returns one row to begin with.

    Finally, query 1. If there are numerous rows with status 0, as Jeff mentions, then the name needs to be sorted to return the first row. Check your query plan and I'm certain you'll find a SORT operation.

    DAB

  • The first thing I'd do is look at the execution plan of each query. If the one that calls for the Name column straight away has something weird in it, like a hash join or table scan, that the others don't, that might be a good place to start investigating.

    After that, if no joy, the next thing I'd do in this case is check the table statistics and the fragmentation on the indexes, including the clustered index. I've seen similar behavior in tables with heavy fragmentation and/or badly out-of-date statistics.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I keep seeing the words "bookmark lookup" float by.

    I'm thinking that's what's causing your first query to be slow. Since the name isn't included in the non-clustered index, a bookmark lookup has to happen. PersonID is primary key (probably also the clustered key), so it's part of BOTH indexes. And the last one is fast since it's a clustered index seek.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Hi all,

    Thanks for all your good ideas. In response to your questions:

    1. It's a real table. Nothing unordinary.

    2. "Forever" means over a minute.

    3. No sorting. The query is indeed as simple as I posted.

    SQLServerLifer said:

    >> Finally, query 1. If there are numerous rows with status 0, as

    >> Jeff mentions, then the name needs to be sorted to return the

    >> first row.

    There are about 30,000 records with Status 0. Are you saying that the index will have to pull out all 30,000 and then they will be sorted by PersonID? All I want is one record, I don't care which, as long as it has Status 0.

    Folks,

    I think the problem might be related to locking and I'd appreciate your views on this.

    The table is accessed concurrently by other threads. Here's a thought...

    When the original query (1) is broken into queries (2) and (3), I've essentially made the server's life much easier in regard to the integrity of the data. Query (2) is required only to return any record so long as the status is 0. It can do this via the non-clustered index. It does not need to impose any table locks on the table itself. Likewise, query (3) is a simple PK lookup.

    What possibly makes query (1) harder, is that now the server wants to prevent a table modification between the time it finds a relevant PersonID via the index and the time it retrieves the Name field from the table. It therefore probably needs to maintain a lock on the table before accessing the index. Without such a lock, the index could return PersonID 94837 as a suitable Person, but before it accesses the Name field, a different thread updates that record to a different name and different status. It might then return a name, which never had status 0. I assume that the server must prevent such a result.

    When I break the query into two, I am essentially relieving the server from having to prevent such a scenario. It becomes my problem.

    Due to heavy concurrency, obtaining the lock might take time.

    Comments, please...

    Thanks,

    Jamie

  • Indeed the problem had to do with locking:

    I changed the original query to:

    select top 1 Name from People with(READUNCOMMITTED) where Status = 0

    And it returns a value immediately.

    Thanks again for all your help,

    Jamie

  • While you fixed your problem, you still have another one lurking. You have an index on a column that has 30,000 identical values out of a 100,000 row table. That index is probably never going to help you out in any way.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the comment. Actually I think it helps very much since I'm only asking for the first record which matches.

  • Jamie, can you try this:

    select top 1 Name from People where Status = 0

    order by PersonID

    Quoting from technet

    http://technet.microsoft.com/en-us/library/ms187373.aspx

    "READUNCOMMITTED

    Specifies that dirty reads are allowed. ...Allowing dirty reads can cause higher concurrency, but at the cost of reading data modifications that then are rolled back by other transactions. This may generate errors for your transaction, or present users with data that was never committed."

    so I'm not sure if you want to use that, since you might wind up with data that gets rolled back after your query. Hence, you get the wrong answer.

    Also, don't forget that the server will cache the execution plan, so the first time you run the query it might be slow but if you run it again it might be quick. I ran similar query against a 1 million+ table, the first time took 11 seconds but subsequent times were 00.00.00 according to the status bar.

    Grant,

    Since the status column is an int, I am guessing that there is a table of statuses that gets joined on the status column, so I would think you do need an index on that column, what do you think?

  • jamiejulius (3/13/2008)


    Thanks for the comment. Actually I think it helps very much since I'm only asking for the first record which matches.

    If the index is being used - that's great. Non-clustered indexes with selectivity that low often get "ignored", or discarded by the optimizer.

    Again - if you're seeing the execution plan use that index - then the index is doing what it needs to.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • WILLIAM MITCHELL (3/14/2008)


    Grant,

    Since the status column is an int, I am guessing that there is a table of statuses that gets joined on the status column, so I would think you do need an index on that column, what do you think?

    Yes & no. It really depends on how selective the column is and how that column is used in the query. Let's assume for a moment that selectivity is fairly low, a safe assumption since we know that 33% of the rows have a single value. In all likelihood then, this index is not being used. But, if you know that you're always going to reference this column and a client column, then making a compound index would be very good. Because then that column is made more selective by pairing it with another. At least that's generally how I've seen things go.

    Still, it's like Jeff said. If the index is being used in an efficient manner (a series of scans or something, while the index is used, aren't good), great. Leave it in place. I just lean pretty heavily in the direction that it's not helping based on the information we've been provided.

    That's my 1.5 cents.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply