cannot understand why query analyzer is picking a specific index.

  • query analyzer is picking a specific index when i run a query and i dont know why. here are the table and index details:

    --query

    SELECT m.Member_No, m.FirstName, m.Region_No

    FROM dbo.Member AS m

    WHERE m.FirstName LIKE 'K%'

    AND m.Region_No > 6

    AND m.Member_No < 5000

    go

    --indexes

    index_name index_description index_keys

    -------------------------- ---------------------------------------------------- ----------

    member_corporation_link nonclustered located on PRIMARY corp_no

    member_ident clustered, unique, primary key located on PRIMARY member_no

    member_region_link nonclustered located on PRIMARY region_no

    MemberCovering2 nonclustered located on PRIMARY region_no, firstname, member_no

    MemberCovering3 nonclustered located on PRIMARY member_no, region_no, firstname

    MemberFirstName nonclustered located on PRIMARY firstname

    --data distribution in the table

    select COUNT(*) from member

    --10000 rows total

    select COUNT(*) from member

    where firstname like 'K%'

    --428 rows out of 10000, roughly 1/23 , 0.043

    select COUNT(*) from member

    where Region_No > 6

    --3754 out of 10000, roughly 1/3 , 0.33

    select COUNT(*) from member

    where Member_No < 5000

    --4999 out of 10000, roughly 1/2 , 0.5

    query analyzer always picks index MemberCovering2, even though i thought it would pick MemberCovering3, as it is a covering index. can anyone enlighten me as to why this is. there is little difference in the query performance when i force it to use either indexes with MemberCovering2 having 44% of the batch and MemberCovering3 having 56% of the batch. I cant see why MemberCovering3 takes more processing. all help appriciated. thanks.

  • what are the statistics on those indexes?

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Since all of the filters are inequalities, SQL can only seek on one of them. The other two have to be applied as a second step after the index seek.

    The available indexes mean that either it can apply the m.Region_No > 6 predicate and then filter out the member and firstnames that don't match, or it can apply the m.Member_No < 5000 predicate and filter out the region and first names.

    The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.

    If it's wrong, it's because the statistics are out of date or inaccurate.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • when i run dbcc show_statistics for each statistic, what info do you require? Ive never read statistics before so not sure what im looking at/for.

  • GilaMonster (1/7/2009)


    Since all of the filters are inequalities, SQL can only seek on one of them. The other two have to be applied as a second step after the index seek.

    The available indexes mean that either it can apply the m.Region_No > 6 predicate and then filter out the member and firstnames that don't match, or it can apply the m.Member_No < 5000 predicate and filter out the region and first names.

    The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.

    If it's wrong, it's because the statistics are out of date or inaccurate.

    Ah, gotcha.

    The optimiser would be correct in thinking that Region_No returns less rows than Member_no so it is correct. Thanks for the help here, much appriciated.

    as per my last post, if anyone can advise on how to understand the output of dbcc show_statistics id appricate it.

    thanks again.

  • Hi Gail,

    After the first seek is done are the other filters done on the index or resultant dataset return from the first filter(Index seek)

    Also I guess I need to ask, why would it use the cover index over the single field index if only one predicate is used? I'm assume because it still needs the data of those fields in the resultant result set?

    is that right?

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Christopher Stobbs (1/7/2009)


    Hi Gail,

    After the first seek is done are the other filters done on the index or resultant dataset return from the first filter(Index seek)

    The latter.

    The index seek results in a rowset of rows that satisfied the seek condition.

    That rowset is then filtered based on the other conditions and rows that don't match are discarded

    The resulting rowset is then passed on to the next query operator (join, aggregation or just the select operator)

    Also I guess I need to ask, why would it use the cover index over the single field index if only one predicate is used? I'm assume because it still needs the data of those fields in the resultant result set?

    Yup. They're needed, both for a filter and for display, it's just that they can't all be seek predicates. If you look at the exec plan for this, the index seek operation would show Region_No > 6 as a seek predicate and the other two as predicates.

    I mention it on this post, but it's probably something that needs a whole post about it - http://sqlinthewild.co.za/index.php/2008/04/23/execution-plans-important-properties/

    If the optimiser picked the single field index, it would then have to do bookmark lookups for each row returned to fetch the remaining columns in order to do the filter, and book mark lookups are expensive. Hence it will rather choose a covering index as it means far less IOs and far less work.

    p.s. I love your sig Christopher.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • HI Gail,

    Thanks again,

    I have just created a few tables and played around with different data selectivity and with the columns being returned. I tried to work out before running the query what index would be used, and well with the education I just received from you, I pretty much got it right all the time 🙂

    Thanks a lot

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • winston Smith (1/7/2009)

    Ah, gotcha.

    The optimiser would be correct in thinking that Region_No returns less rows than Member_no so it is correct. Thanks for the help here, much appriciated.

    as per my last post, if anyone can advise on how to understand the output of dbcc show_statistics id appricate it.

    thanks again.

    Just to do a quick rundown of the dbcc showstatistics, assuming you are looking at the output of the histogram essentially what you see is

    Range_Hi_Key - Leftmost column of the index you have requested stats from

    Range_Rows - Estimated number of rows that fall within range excluding the upper bound

    EQ_Rows - Estimated number of rows in table that have the Range_Hi_Key

    Distinct_Range_Rows - Estimated number of distinct values excluding upper bound

    AVG_Range_rows - Average of duplicate values excluding upper bound.

    To put into a real world perspective, I took some stats from one of my tables that I new was out of date. I then updated the stats with fullscan and captured the data again. It is as follows

    Before

    790256701

    8006901

    107085701

    57706501

    581010201

    586019801

    59305701

    6130154401

    61601801

    6170276901

    6180454301

    64402501

    After

    7901893201

    800206401

    1070515601

    42403401

    577013401

    581010201

    586044501

    588047601

    593023101

    6130739801

    61606101

    61701039701

    61801926901

    644010401

    Notice in the after the count of rows within the UpperBound range. It is significantly different than in the before, but the after values are exactly correct (thanks to doing the update stats with full scan). Having differences like these can certainly impact how your statistics are going to be used by the optimizer and whether it will be able to choose the correct index for a specific task. Hopefully this is somewhat helpful.

  • yes, this is really great stuff. thanks a mil. Im just getting into optimization, outside of the the dynamic views and functions which are not all that reliable. the show_statistics command confused the heck out of me but is slowly starting to make sense. a few more hours fooling around with different indexes and stats should give me a better understanding, on top of what you have just shown me.

    thanks again all.

  • GilaMonster (1/7/2009)


    The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.

    If it's wrong, it's because the statistics are out of date or inaccurate.

    so the optimizer picks what returns fewer rows. I was told, in an sql course that when indexing for AND queries, always try to use an index on the most selective column.

    i would have thought that Member_no would be more selective as there is a different one for each row, rather than region, where there can be many rows for each region.

    unless my definition of selective is incorrect. am i correct or incorrect in saying that a comumn that is distinct for each row is more selective? apologies, i know this is not an english forum.

    note: statisitcs are up to date.

  • winston Smith (1/8/2009)


    I was told, in an sql course that when indexing for AND queries, always try to use an index on the most selective column.

    I hate it when people say that and don't explain why. It leads to so many misunderstandings later. (Like the person who heard that and then went and put the primary key column as the leading column of every index and then wondered why performance went through the floor)

    unless my definition of selective is incorrect. am i correct or incorrect in saying that a comumn that is distinct for each row is more selective?

    Your definition of selective is correct. The problem is that the statement that 'most selective column first is best' is incomplete.

    SQL only keeps the histogram (the row distribution) for the first column of the index, and so that's the one that SQL will use to see if it thinks that an index is useful. Hence a very non-selective leading column may lead to the index not been used much. Note, I said may.

    SQL does, however, keep the densities of the other columns, so it knows approximately how unique the left-based combinations of the columns are.

    The main purpose of an index is to reduce the number of rows in consideration for a query as fast as possible. So, let's say we have the following scenario.

    SELECT col1, col2

    FROM SomeTable

    WHERE col1 = @Var1 AND col2 = @var2

    Further, let's assume that we have two indexes, one on Col1 and one on Col2. There are a million rows in SomeTable, 50000 of those rows satisfy the col1 = @var1 filter and 2000 of them satisfy the col2 = @var2 filter. 25 rows satisfy both conditions.

    Which index is SQL going to use to run that query?

    It's not hard to see that it's going to take the second index, because fewer rows are returned, and fewer rows have to be fetched from the cluster to do the second filter.

    Now, let's widen the first index so that, on SomeTable, we have two indexes, one (col1, col2) and one (col2). Now SQL's going to pick the first index, the one that has col1 as the leading column, even though col1 is less selective. It does that because the index is composite and hence it can do both filters in one operation and locate exactly the 25 rows that it needs.

    So, that's the story for equality predicates. Inequalities are a whole nother story, and not a simple one.

    One thing to note with inequalities is that, because they are range scans, and because of the way indexes are ordered, if a column in an index is used for an inequality match, none of the columns after that can be used for seek predicates. ie, any filters on them have to be applied after the seek is done and the rows returned.

    Think of a phone book. Finding Brown, M is very easy because the phone book is ordered surname, initial. Now, consider finding all the people with an initial of M and a surname starting with B. It's no longer a 1 step seek is it?

    Let's take a hypothetical table that has 3 columns, an identity (hence unique and starting from 1), a datetime (say the date the row was inserted) and a string that has 100 different values. Let's say the table has 10000 rows.

    The identity column is highly selective, it's unique.

    The date is quite selective, let's say there are 5000 unique values for it spread between 1 Jan 2008 and 31 Dec 2008

    The string is not very selective. On average, each value will appear in the table 100 times.

    There are four indexes on the table, all covering, (id, dt, string) (dt, id, string), (string, id, dt), (string, dt, id)

    If the 'most selective' rule applied, you'd think that SQL would always use the one leading with ID.

    Now, let's take a query

    SELECT * FROM HypotheticalTable

    WHERE ID > 1000 -- 9000 qualifying rows

    AND dt < '2008/07/01' -- assume 5000 qualifying rows

    AND string = 'abc' -- 100 qualifying rows.

    Which index is SQL most likely to use?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Awesome example Gail...

    I hope everyone reads this thread as it's a perfect example of teaching someone to fish rather than just feeding them...

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • im thinking the optimizer will pick the (string, dt, id) index as it has less rows to return with the string, max of 100, rather than 5000 or 10000 rows that would be returned for either of the other two columns.

    once it has its 100 rows, it can filter them on date, as if it filtered on id, 100 individual rows could be returned from the 100 rows filtered based on the string column?

    is this the correct train of thought?

    Gail, thanks for taking the time and patience to go into great detail on this. i have search the web and cannot find this level of detail anywhere. Wherever you work, I hope you are extreemly well paid, and whatever you get, its not half enough for your expertise.

  • winston Smith (1/8/2009)


    im thinking the optimizer will pick the (string, dt, id) index as it has less rows to return with the string, max of 100, rather than 5000 or 10000 rows that would be returned for either of the other two columns.

    Ding, ding, ding. We have a winner.

    At least, that's what I think the optimiser will do. It's surprised me before.

    SQL knows the data distribution of the first column, and it knows the density of the first and second column together, so it can work out that it should, if the data's evenly distributed, get around 50 rows if it does a seek on string and dt

    once it has its 100 rows, it can filter them on date, as if it filtered on id, 100 individual rows could be returned from the 100 rows filtered based on the string column?

    It can and will do the equality and one of the inequalities as a single operation, an index seek. It's the second inequality that will have to be applied afterwards, because the data that qualifies for the one inequality won't be in the correct order to simply seek and read for the other one.

    Basically, the way things work in this case (at a highish level) is as follows.

    1) Seek on the index, look for where the string = 'abc' rows are, start from the lowest date. Easy, because the rows are ordered by string and within string by dt

    2) Read along the index leaf pages, checking the values of id to see if they qualify. If they do, return the row

    3) When the date reaches '2008/07/01' or the value of string changes, stop reading as no further rows can possibly qualify.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply