cannot understand why query analyzer is picking a specific index.

Question

cannot understand why query analyzer is picking a specific index.

winston Smith

SSCoach

Points: 19500
More actions
January 7, 2009 at 8:25 am

#198069

query analyzer is picking a specific index when i run a query and i dont know why. here are the table and index details:
--query
SELECT m.Member_No, m.FirstName, m.Region_No
FROM dbo.Member AS m
WHERE m.FirstName LIKE 'K%'
AND m.Region_No > 6
AND m.Member_No < 5000
go
--indexes
index_name index_description index_keys
-------------------------- ---------------------------------------------------- ----------
member_corporation_link nonclustered located on PRIMARY corp_no
member_ident clustered, unique, primary key located on PRIMARY member_no
member_region_link nonclustered located on PRIMARY region_no
MemberCovering2 nonclustered located on PRIMARY region_no, firstname, member_no
MemberCovering3 nonclustered located on PRIMARY member_no, region_no, firstname
MemberFirstName nonclustered located on PRIMARY firstname
--data distribution in the table
select COUNT(*) from member
--10000 rows total
select COUNT(*) from member
where firstname like 'K%'
--428 rows out of 10000, roughly 1/23 , 0.043
select COUNT(*) from member
where Region_No > 6
--3754 out of 10000, roughly 1/3 , 0.33
select COUNT(*) from member
where Member_No < 5000
--4999 out of 10000, roughly 1/2 , 0.5
query analyzer always picks index MemberCovering2, even though i thought it would pick MemberCovering3, as it is a covering index. can anyone enlighten me as to why this is. there is little difference in the query performance when i force it to use either indexes with MemberCovering2 having 44% of the batch and MemberCovering3 having 56% of the batch. I cant see why MemberCovering3 takes more processing. all help appriciated. thanks.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 1

what are the statistics on those indexes?

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 2

Since all of the filters are inequalities, SQL can only seek on one of them. The other two have to be applied as a second step after the index seek.

The available indexes mean that either it can apply the m.Region_No > 6 predicate and then filter out the member and firstnames that don't match, or it can apply the m.Member_No < 5000 predicate and filter out the region and first names.

The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.

If it's wrong, it's because the statistics are out of date or inaccurate.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

winston Smith SSCoach Points: 19500 More actions · Answer 3

when i run dbcc show_statistics for each statistic, what info do you require? Ive never read statistics before so not sure what im looking at/for.

winston Smith SSCoach Points: 19500 More actions · Answer 4

GilaMonster (1/7/2009)
Since all of the filters are inequalities, SQL can only seek on one of them. The other two have to be applied as a second step after the index seek.
The available indexes mean that either it can apply the m.Region_No > 6 predicate and then filter out the member and firstnames that don't match, or it can apply the m.Member_No < 5000 predicate and filter out the region and first names.
The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.
If it's wrong, it's because the statistics are out of date or inaccurate.

Ah, gotcha.

The optimiser would be correct in thinking that Region_No returns less rows than Member_no so it is correct. Thanks for the help here, much appriciated.

as per my last post, if anyone can advise on how to understand the output of dbcc show_statistics id appricate it.

thanks again.

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 5

Hi Gail,

After the first seek is done are the other filters done on the index or resultant dataset return from the first filter(Index seek)

Also I guess I need to ask, why would it use the cover index over the single field index if only one predicate is used? I'm assume because it still needs the data of those fields in the resultant result set?

is that right?

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

Christopher Stobbs (1/7/2009)
Hi Gail,
After the first seek is done are the other filters done on the index or resultant dataset return from the first filter(Index seek)

The latter.

The index seek results in a rowset of rows that satisfied the seek condition.

That rowset is then filtered based on the other conditions and rows that don't match are discarded

The resulting rowset is then passed on to the next query operator (join, aggregation or just the select operator)

Also I guess I need to ask, why would it use the cover index over the single field index if only one predicate is used? I'm assume because it still needs the data of those fields in the resultant result set?

Yup. They're needed, both for a filter and for display, it's just that they can't all be seek predicates. If you look at the exec plan for this, the index seek operation would show Region_No > 6 as a seek predicate and the other two as predicates.

I mention it on this post, but it's probably something that needs a whole post about it - http://sqlinthewild.co.za/index.php/2008/04/23/execution-plans-important-properties/

If the optimiser picked the single field index, it would then have to do bookmark lookups for each row returned to fetch the remaining columns in order to do the filter, and book mark lookups are expensive. Hence it will rather choose a covering index as it means far less IOs and far less work.

p.s. I love your sig Christopher.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 7

HI Gail,

Thanks again,

I have just created a few tables and played around with different data selectivity and with the columns being returned. I tried to work out before running the query what index would be used, and well with the education I just received from you, I pretty much got it right all the time 🙂

Thanks a lot

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Mike - CI SSCarpal Tunnel Points: 4146 More actions · Answer 8

winston Smith (1/7/2009)
Ah, gotcha.
The optimiser would be correct in thinking that Region_No returns less rows than Member_no so it is correct. Thanks for the help here, much appriciated.
as per my last post, if anyone can advise on how to understand the output of dbcc show_statistics id appricate it.
thanks again.

Just to do a quick rundown of the dbcc showstatistics, assuming you are looking at the output of the histogram essentially what you see is

Range_Hi_Key - Leftmost column of the index you have requested stats from

Range_Rows - Estimated number of rows that fall within range excluding the upper bound

EQ_Rows - Estimated number of rows in table that have the Range_Hi_Key

Distinct_Range_Rows - Estimated number of distinct values excluding upper bound

AVG_Range_rows - Average of duplicate values excluding upper bound.

To put into a real world perspective, I took some stats from one of my tables that I new was out of date. I then updated the stats with fullscan and captured the data again. It is as follows

Before

790256701

8006901

107085701

57706501

581010201

586019801

59305701

6130154401

61601801

6170276901

6180454301

64402501

After

7901893201

800206401

1070515601

42403401

577013401

581010201

586044501

588047601

593023101

6130739801

61606101

61701039701

61801926901

644010401

Notice in the after the count of rows within the UpperBound range. It is significantly different than in the before, but the after values are exactly correct (thanks to doing the update stats with full scan). Having differences like these can certainly impact how your statistics are going to be used by the optimizer and whether it will be able to choose the correct index for a specific task. Hopefully this is somewhat helpful.

winston Smith SSCoach Points: 19500 More actions · Answer 9

yes, this is really great stuff. thanks a mil. Im just getting into optimization, outside of the the dynamic views and functions which are not all that reliable. the show_statistics command confused the heck out of me but is slowly starting to make sense. a few more hours fooling around with different indexes and stats should give me a better understanding, on top of what you have just shown me.

thanks again all.

winston Smith SSCoach Points: 19500 More actions · Answer 10

GilaMonster (1/7/2009)
The query optimiser will use the statistics on the two indexes to work out which option will reduce the rows to be considered the most, ie whether m.Region_No > 6 or m.Member_No < 5000 return fewer rows. It believes that the filter on Region number returns fewer rows than the the filter on member number.
If it's wrong, it's because the statistics are out of date or inaccurate.

so the optimizer picks what returns fewer rows. I was told, in an sql course that when indexing for AND queries, always try to use an index on the most selective column.

i would have thought that Member_no would be more selective as there is a different one for each row, rather than region, where there can be many rows for each region.

unless my definition of selective is incorrect. am i correct or incorrect in saying that a comumn that is distinct for each row is more selective? apologies, i know this is not an english forum.

note: statisitcs are up to date.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 11

winston Smith (1/8/2009)
I was told, in an sql course that when indexing for AND queries, always try to use an index on the most selective column.

I hate it when people say that and don't explain why. It leads to so many misunderstandings later. (Like the person who heard that and then went and put the primary key column as the leading column of every index and then wondered why performance went through the floor)

unless my definition of selective is incorrect. am i correct or incorrect in saying that a comumn that is distinct for each row is more selective?

Your definition of selective is correct. The problem is that the statement that 'most selective column first is best' is incomplete.

SQL only keeps the histogram (the row distribution) for the first column of the index, and so that's the one that SQL will use to see if it thinks that an index is useful. Hence a very non-selective leading column may lead to the index not been used much. Note, I said may.

SQL does, however, keep the densities of the other columns, so it knows approximately how unique the left-based combinations of the columns are.

The main purpose of an index is to reduce the number of rows in consideration for a query as fast as possible. So, let's say we have the following scenario.

SELECT col1, col2

FROM SomeTable

WHERE col1 = @Var1 AND col2 = @var2

Further, let's assume that we have two indexes, one on Col1 and one on Col2. There are a million rows in SomeTable, 50000 of those rows satisfy the col1 = @var1 filter and 2000 of them satisfy the col2 = @var2 filter. 25 rows satisfy both conditions.

Which index is SQL going to use to run that query?

It's not hard to see that it's going to take the second index, because fewer rows are returned, and fewer rows have to be fetched from the cluster to do the second filter.

Now, let's widen the first index so that, on SomeTable, we have two indexes, one (col1, col2) and one (col2). Now SQL's going to pick the first index, the one that has col1 as the leading column, even though col1 is less selective. It does that because the index is composite and hence it can do both filters in one operation and locate exactly the 25 rows that it needs.

So, that's the story for equality predicates. Inequalities are a whole nother story, and not a simple one.

One thing to note with inequalities is that, because they are range scans, and because of the way indexes are ordered, if a column in an index is used for an inequality match, none of the columns after that can be used for seek predicates. ie, any filters on them have to be applied after the seek is done and the rows returned.

Think of a phone book. Finding Brown, M is very easy because the phone book is ordered surname, initial. Now, consider finding all the people with an initial of M and a surname starting with B. It's no longer a 1 step seek is it?

Let's take a hypothetical table that has 3 columns, an identity (hence unique and starting from 1), a datetime (say the date the row was inserted) and a string that has 100 different values. Let's say the table has 10000 rows.

The identity column is highly selective, it's unique.

The date is quite selective, let's say there are 5000 unique values for it spread between 1 Jan 2008 and 31 Dec 2008

The string is not very selective. On average, each value will appear in the table 100 times.

There are four indexes on the table, all covering, (id, dt, string) (dt, id, string), (string, id, dt), (string, dt, id)

If the 'most selective' rule applied, you'd think that SQL would always use the one leading with ID.

Now, let's take a query

SELECT * FROM HypotheticalTable

WHERE ID > 1000 -- 9000 qualifying rows

AND dt < '2008/07/01' -- assume 5000 qualifying rows

AND string = 'abc' -- 100 qualifying rows.

Which index is SQL most likely to use?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 12

Awesome example Gail...

I hope everyone reads this thread as it's a perfect example of teaching someone to fish rather than just feeding them...

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

winston Smith SSCoach Points: 19500 More actions · Answer 13

im thinking the optimizer will pick the (string, dt, id) index as it has less rows to return with the string, max of 100, rather than 5000 or 10000 rows that would be returned for either of the other two columns.

once it has its 100 rows, it can filter them on date, as if it filtered on id, 100 individual rows could be returned from the 100 rows filtered based on the string column?

is this the correct train of thought?

Gail, thanks for taking the time and patience to go into great detail on this. i have search the web and cannot find this level of detail anywhere. Wherever you work, I hope you are extreemly well paid, and whatever you get, its not half enough for your expertise.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 14

winston Smith (1/8/2009)
im thinking the optimizer will pick the (string, dt, id) index as it has less rows to return with the string, max of 100, rather than 5000 or 10000 rows that would be returned for either of the other two columns.

Ding, ding, ding. We have a winner.

At least, that's what I think the optimiser will do. It's surprised me before.

SQL knows the data distribution of the first column, and it knows the density of the first and second column together, so it can work out that it should, if the data's evenly distributed, get around 50 rows if it does a seek on string and dt

once it has its 100 rows, it can filter them on date, as if it filtered on id, 100 individual rows could be returned from the 100 rows filtered based on the string column?

It can and will do the equality and one of the inequalities as a single operation, an index seek. It's the second inequality that will have to be applied afterwards, because the data that qualifies for the one inequality won't be in the correct order to simply seek and read for the other one.

Basically, the way things work in this case (at a highish level) is as follows.

1) Seek on the index, look for where the string = 'abc' rows are, start from the lowest date. Easy, because the rows are ordered by string and within string by dt

2) Read along the index leaf pages, checking the values of id to see if they qualify. If they do, return the row

3) When the date reaches '2008/07/01' or the value of string changes, stop reading as no further rows can possibly qualify.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass