Index Implementation Question

  • For the sake of argument, suppose I have a 10-million-record table with no index at all. The fields in this table include DataWeek (a smalldatetime; 10 unique values) and CustomerKey (int; 1000 unique values). Suppose I want to create a single nonclustered index involving both fields.

    Hypothetically, if EVERY query EVER run against this table used an equality filter in the WHERE clause on both fields (e.g., WHERE DataWeek = '4/12/2008' AND CustomerKey = 100), would it matter whether i defined my index as (DataWeek, CustomerKey) or (CustomerKey, DataWeek), and if so, how?

  • Steve F. (4/22/2008)


    For the sake of argument, suppose I have a 10-million-record table with no index at all. The fields in this table include DataWeek (a smalldatetime; 10 unique values) and CustomerKey (int; 1000 unique values). Suppose I want to create a single nonclustered index involving both fields.

    Hypothetically, if EVERY query EVER run against this table used an equality filter in the WHERE clause on both fields (e.g., WHERE DataWeek = '4/12/2008' AND CustomerKey = 100), would it matter whether i defined my index as (DataWeek, CustomerKey) or (CustomerKey, DataWeek), and if so, how?

    Hypothetically, probably not. This looks like something that a good test would help solve if there are any takers out there (Jeff, Matt).

    Is your hypothetical situation ever going to happen, probably not, and then you do have to look carefully at how you construct your indexes.

    😎

  • In that case, I have another question: In what ways is the index physically implemented differently depending on the order of the columns within it? I understand that the first field in an index determines whether SQL Server can use it, but is the index ordered first by the first column, making range queries more effective on the first column in the index? I'm just prodding for info here; I'm curious.

    Thanks again.

  • You should put the most selective column first ie the column with the greater number of distinct values.

    The index would be ordered by the first column then second column so you are correct in that a range query is fine on the first column but the index may not help on the second column if the query doesn't include the first. Also statistics are really only useful for the first column - although the stats include cardinality / selectivity estimates for subsequent columns, they are only relevant in some situations eg determining whether the index can cover the query.

  • The index is implemented much like a telephone directory. If you have a composite index (colour, age as a very contrived example) then the index pages look something like this

    Black 75

    Blue 40

    Blue 50

    Blue 55

    Green 1

    Green 10

    Red 0

    Red 100

    ...

    If all your queries are equality on both columns, the order doesn't matter. It becomes important when you have range seeks or single column seeks.

    The contrived example above can be used to seek the following queries

    WHERE Colour = @x and Age = @y

    WHERE Colour = @x and Age between @y1 and @y2

    WHERE Colour BETWEEN @x1 and @x2

    The folowing requires a scan, because the query does not inlcude the left column of the index

    WHERE Age = @y

    The advice about making an indexes first column the most selective is often mentioned. I tend to disagreee with it. The index columns should be arranged so that the queries can use it. The first column should be the one that can appear alone in queries, not the most selective column

    When it does a 2-column seek, the 2 columns are matched in a single operation. Again using the above example,

    WHERE Colour = @x and Age = @y

    does not mean that SQL will find all the colours = @x, then seek again to find the age @y. If does a single seek on both conditions.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Oh, and with the selectivity you're talking about (1000 unique values in 10 million rows), if that index is not covering, SQL is not going to use that index. Too expensive to do the RID lookups.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • And some testing...

    SET STATISTICS IO ON

    GO

    SET STATISTICS TIME ON

    GO

    CREATE TABLE TestingIndexOrder (

    ID INT IDENTITY,

    DayOfWeek smallint,

    CustomerKey int

    )

    GO

    insert INTO TestingIndexOrder (DayOfWeek, CustomerKey)

    SELECT DATEPART(dw,DATEADD(dd,number, '1900/01/01')), FLOOR(RAND(number*5452)*1000)

    FROM (

    SELECT TOP 4000000 v1.number, v1.number + v2.number AS Num

    FROM master..spt_values v1, master..spt_values v2 WHERE v1.name is NULL AND v2.name IS null

    UNION ALL

    SELECT TOP 4000000 v1.number, v1.number + v2.number AS Num

    FROM master..spt_values v1, master..spt_values v2 WHERE v1.name is NULL AND v2.name IS null

    UNION ALL

    SELECT TOP 2000000 v1.number, v1.number + v2.number AS Num

    FROM master..spt_values v1, master..spt_values v2 WHERE v1.name is NULL AND v2.name IS null ) x

    GO

    CREATE INDEX idx_test1 ON TestingIndexOrder (DayOfWeek, CustomerKey)

    GO

    SELECT DayOfWeek, CustomerKey FROM TestingIndexOrder WHERE DayOfWeek =3 AND CustomerKey = 504

    /*

    (4096 row(s) affected)

    Table 'TestingIndexOrder'. Scan count 1, logical reads 14, physical reads 0, read-ahead reads 0

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 121 ms.

    */

    DROP INDEX idx_test1 ON TestingIndexOrder

    CREATE INDEX idx_test2 ON TestingIndexOrder (CustomerKey, DayOfWeek)

    GO

    SELECT DayOfWeek, CustomerKey FROM TestingIndexOrder WHERE DayOfWeek =3 AND CustomerKey = 504

    /*

    (4096 row(s) affected)

    Table 'TestingIndexOrder'. Scan count 1, logical reads 14, physical reads 0, read-ahead reads 0

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 123 ms.

    */

    DROP TABLE TestingIndexOrder

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    Thanks for jumping in on this and doing some testing. I tried to set up a test myself, but my mind wasn't working right and the test data I contrived didn't meet the needs of the hypothetical question.

    I agree with you regarding the first column of an index, it shouldn't always be the most selective, but should be based on the queries used on the tables.

    😎

  • I've seen a case where a person took the most-selectivel advice literally, and made the leading column of every index the primary key (an identity)

    He was very curious as to why he wasn't seeing index seeks. (none of the queries filtered on the PK)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Great info. Thanks all.

    Any recommendations for books that cover this topic in depth?

  • GilaMonster (4/24/2008)


    When it does a 2-column seek, the 2 columns are matched in a single operation. Again using the above example,

    WHERE Colour = @x and Age = @y

    does not mean that SQL will find all the colours = @x, then seek again to find the age @y. If does a single seek on both conditions.

    Hmm... Does this mean that the order of the columns in an index -after- the first column is irrelevant?

  • No. You can generalise my 2 column example to as many columns as you have in the index.

    For example, a 3 column index (A, B, C) is most optimal for queries of the following forms

    WHERE A = @a and B = @b-2 and C=@c

    WHERE A = @a and B = @b-2 and C between @c1 and @c2

    WHERE A = @a and B = @b-2

    WHERE A = @a and B between @b1 and @b2

    WHERE A = @a

    WHERE A between @a1 and @a2

    It's less optimal for things like

    WHERE A = @a and C=@c (because the where clause predicates are not a left-based subset of the index.)

    It will require a seek on A=@a, followed by a filter to do C=@c

    Queries of the following form require a scan

    WHERE B = @b-2 and C=@c

    Think of a phone book, where entries are ordered by surname, first name, phone number

    Which is easier, to find all the people with a surname Brown (1st column) and a phone number of 555-6572 (3rd column)

    or to find all the people with a surname Brown (1st column) and a first name of Matt (2nd column)?

    As for books, have a look at the 4th book in the Inside SQL Server 2005 series. Query tuning and optimisation. There should be something in there.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (4/24/2008)


    The advice about making an indexes first column the most selective is often mentioned. I tend to disagreee with it. The index columns should be arranged so that the queries can use it. The first column should be the one that can appear alone in queries, not the most selective column.

    When it does a 2-column seek, the 2 columns are matched in a single operation. Again using the above example,

    WHERE Colour = @x and Age = @y

    does not mean that SQL will find all the colours = @x, then seek again to find the age @y. If does a single seek on both conditions.

    I agree with the above, however the non-leaf pages still need to be read - if the less selective column is first it can potentially mean more reads (only one more if I'm thinking straight, but 1 is more than 0!).

    Of course, if there is any chance of different queries being used in future then the index would have to be reevaluated.

  • matt stockham (4/24/2008)


    I agree with the above, however the non-leaf pages still need to be read - if the less selective column is first it can potentially mean more reads (only one more if I'm thinking straight, but 1 is more than 0!).

    Could you please explain how that scenario works?

    Bear in mind, that for a multi-column index, all the columns are in all the pages. Leaf and non-leaf (Ignoring the SQL 2005 include columns, that is)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • check this out

    i remember the query optimization stuff gail mentioned about can be found in the book "SQL Server Query Performance Tuning Distilled by sajal dam"

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply