Why does this scan an entire NC index?

  • I'm puzzled:

    CREATE TABLE [Edr_T] (

    [EdrRecordID] [int] IDENTITY (1, 1) NOT NULL ,

    [EDRDate] [datetime] NOT NULL ,

    [BTOrderNumber] [varchar] (26) NULL ,

    [OrderType] [varchar] (7) NULL ,

    [OrderStatus] [varchar] (4) NULL ,

    [CLI] [varchar] (14) NULL ,

    [EffectiveActivityDate] [datetime] NULL ,

    [BTAccountNumber] [varchar] (8) NULL ,

    [StatusUpdated] [bit] NOT NULL ,

    [RentalLiabilityDate] [varchar] (10) NULL ,

    CONSTRAINT [PK_Edr_T] PRIMARY KEY CLUSTERED

    (

    [EdrRecordID]

    ) WITH FILLFACTOR = 90 ON [PRIMARY]

    CREATE INDEX [IX_StatusUpdated] ON [dbo].[Edr_T]([StatusUpdated], [BTOrderNumber]) WITH FILLFACTOR = 90 ON [PRIMARY]

    GO

    CREATE INDEX [IX_BTOrderNumber] ON [dbo].[Edr_T]([BTOrderNumber], [StatusUpdated]) WITH FILLFACTOR = 90 ON [PRIMARY]

    GO

    The EDR_T table contains 355,000 rows.

    Every row has a value of 1 for StatusUpdated.

    If I run the following simple query:

    SELECT BTOrderNumber

    FROM EDR_T

    WHERE StatusUpdated = 0

    I would expect it to use the IX_StatusUpdated index, quickly determine that there are NO matching records (since the first value it will find is a value of 1), and immediately return "0 rows affected".

    In reality, it scans the entire IX_BTOrderNumber index, with an estimated subtree cost of 1.37, and a logical read value (from statistics IO)  of 1228.

    Even if I force it to use the IX_StatusUpdated index, the estimated subtree cost is still 1.37 and the logical read value is still 1228.

    I've dropped and recreated all the indexes and PK constraints, I've deleted all the statistics and just recreated the ones I need, but still I get the same behaviour.

    Why does it not instinctively choose the IX_StatusUpdated index when the outermost column of this index matches the column in the Search Argument in the WHERE clause, and why does it scan the entire index when it ought to know that if the first value is >0 there cannot be any rows that equal zero?

  • Having an INDEX on a column where you have only 2 different values really won't buy you much.

    What you may want to consider is something like

    IF EXISTS (SELECT * FROM dbo.EDR_T WHERE StatusUpdated = 0)

      BEGIN

        Stuff here

      END

    This will immediately jump into the "Stuff" part if it finds any orders that have not been updated.  Granted you will still need to go find them (using your code from above)...



    Good Hunting!

    AJ Ahrens


    webmaster@kritter.net

  • I'd agree with you if those two values were equally distributed (e.g. M and F  for Male and Female).  However, in our case, we KNOW that the number of records matching 0 should be very few, so this is an excellent index - very selective.  Furthermore, it's a covered index, and so should be even more efficient.  The optimizer clearly recognizes that it's covering the query since it chooses to scan the NC index and not the clustered index.

    You're quite right with your code snippet - the code should jump IMMEDIATELY into the "Stuff" part - my issue is that it's scanning all 2500-odd pages in the index before deciding what to do,  when a quick look at the first page should tell it that there cannot possibly be any records matching 0.

  • I think the culprit is the Bit specification on StatusUpdated.

    I changed it to int, and the query plan uses the IX_StatusUpdated.

    My guess is that the query optimiser looks at the bit specification and assumes that it is not selective enough - without looking at the statistics which would tell it there are in fact no qualifying rows. It therefore chooses to do the index scan on the other index. (why - i don't know).

    It uses an index scan instead of a clustered index scan as IX_BTOrderNumber is a covering index for the query.



    Best Regards,

    Otto Schreibke

    The future is a foreign country, they do things differently there.
    (Stolen from Arthur C Clarke)

  • That's a very good spot indeed, Otto!  I've made the change, and it does indeed now do an index seek, requiring only 3 logical reads.  Just what I thought it should be doing in the first place.  Thanks very much for your help!

    I suspect that your explanation is also spot-on.  It's a non-obvious way to behave, but I suppose it could be argued that a bit column is more likely to have a fairly even distribution of values than not.  It's not an argument that I agree with (as our data shows), and I still don't see why the optimizer wouldn't consider statistics on a bit column.  Maybe I can do some more digging on that now I know what the explanation is ...

    Thanks again.

  • Perhaps another piece of the puzzle:

    When I was testing, i tried to do a DBCC Show_Statistics on the index. With StatusUpdated defined as in int, it returned the three grids as expected. With StatusUpdated defined as a bit - I got the text messages but no grid.

    I wonder if the bit specification at the start of the index is causing the statistics creation to fail which is rendering the whole index somehow 'invisible'. That would explain why the optimiser is using the OrderNumber-status index in preference to the status-orderNumber index.

    Regards

    Otto



    Best Regards,

    Otto Schreibke

    The future is a foreign country, they do things differently there.
    (Stolen from Arthur C Clarke)

  • "However, in our case, we KNOW that the number of records matching 0 should be very few, so this is an excellent index - very selective."

    BUT the problem for you is that SQL doesn't know that you are searching for the few amongst the millions....the query plan would be efficient IF you were looking for all the '1's.

    The flip-flop nature of bit fields means that the selectivity can only be 50%....which means they are not good candidates for indices.  Moving to a int field, opens the possibility for far more values, albeit you are only using 2.  In effect the 'larger possible value range', increases it's chances of being selected as an index.....very interesting...maybe the optimiser in future releases can spot + resolve this minor anomoly.

  • My guess is the optimiser is being too clever for your own good here ... as it knows bit can have only two values it assumes it will not be very selective so opts for a scan.

  • I did a bit more digging on the 'restricted set of values' theme

    I tried the following variants of the StatusUpdated column:

    bit

    Integer

    Integer with a check constraint (StatusUpdated in (0, 1))

    Char(1)

    Char(1) with a check constraint (StatusUpdated in ('Y', 'N'))

    I ran query

    SELECT BTOrderNumber

    FROM EDR_T

    WHERE StatusUpdated = 0 -- 'N' if char

    Except for the case where the column was defined as a bit, the optimiser used a scan on index IX_StatusUpdated.

    Using the Char(1) with check constraint configuration, I populated the query with 100% 'Y' values in the StatusUpdated column.

    Running query

    SELECT BTOrderNumber, ordertype

    FROM EDR_T

    WHERE StatusUpdated = 'Y'

    resulted in a clustered index scan as neither of the two nonclustered indexes cover this query.

     

    Changing the query to

    SELECT BTOrderNumber, ordertype

    FROM EDR_T

    WHERE StatusUpdated = 'N'

    Resulted in an index seek on IX_StatusUpdated and a bookmark lookup, indicating the Optimiser has recognised the fact there are very few rows (namely 0) with StatusUpdated = 'N'.

    My conclusion is that the problem probably lies with the bit definition, not the restricted set of values as the check constraint imposes similar restrictions to two distinct values. It could of course be that the optimiser doesn't take check constraints into account.



    Best Regards,

    Otto Schreibke

    The future is a foreign country, they do things differently there.
    (Stolen from Arthur C Clarke)

  • Andrew,

    Sorry, but you're making my point for me.  The optimizer SHOULD know that there are only a few 0's.  That's what statistics are for.  There's nothing wrong with having an index on a 2-value column; so long as the value that you're searching for is sufficiently scarce (and therefore selective) the index should be used.  If the search argument in my query was   "...WHERE StatuUpdated = 1"  then I wouldn't expect the NC index to be used, as there would be too many matching rows.

    Otto has neatly shown that it appears to be the BIT datatype that makes the difference, although why this should be so I'm not sure.  The fact that they can only have one of two values has nothing whatever to do with their selectivity.  That is down to the frequency with which either one of those values occurs.

  • I think I may have an explanation for this now.

    When writing 

    select bit_col from table where bit_col = 1

    the value of 1 is an integer (because we haven't said that it should be treated as anything else).  This means that we're comparing a BIT column with an INTEGER expression, and so an implicit conversion needs to happen.  According to the datatype hierarchy listings, INTEGER is higher than BIT, so it is the column data that gets converted rather than the interger expression.

    This means that an index on the BIT column cannot be used.

    If we write:

    select bit_col from table where bit_col = cast(1 as bit)

    this will now use the index.

    Of course, the usual rules of index selection still apply - the number of records in the table that match the value of 1 must be very small.

  • Nice catch, confirmed :

    GO

    Create table test

    ( id int not null identity(1,1) primary key clustered,

    someBit bit not null

    )

    GO

    CREATE INDEX IX_text_Somebit on dbo.Test (SomeBit)

    GO

    Insert into dbo.Test (SomeBit) values (1)

    Insert into dbo.Test (SomeBit) values (0)

    GO

    SET SHOWPLAN_TEXT ON

    GO

    Select * from dbo.Test where SomeBit = CAST(1 AS BIT)

    -- |--Index Seek(OBJECT[Ideal].[dbo].[test].[IX_text_Somebit]), SEEK[test].[someBit]=1) ORDERED FORWARD)

    GO

    Select * from dbo.Test where SomeBit = 1

    -- |--Index Scan(OBJECT[Ideal].[dbo].[test].[IX_text_Somebit]), WHEREConvert([test].[someBit])=[@1]))

    GO

    SET SHOWPLAN_TEXT OFF

    GO

    DROP TABLE TEST

  • As an update to the previous posts, I happened to try this again in SQL2005, just to demonstrate the effect to a colleague.

    It doesn't do it! SQL2005 happily uses an index SEEK, whether or not the expression is first cast as a BIT. Explicitly converting the table column doesn't stop an index seek being done either, so it looks as if Microsoft have fixed this limitation.

    The graphical showplan output is different, too. Rather than showing an NC index seek leading to a BookmarkLookup, the output shows a NC index seek plus a PK seek, each accounting for 50% of the total cost, and the output from each being joined using a nested loop.

    I guess this is a better description of how the bookmark lookup actually works (the PK value obtained from the NC index lookup is passed into the PK index in order to locate the whole row in the main table). If I drop the PK from the table, the showplan output now shows a separate RID lookup being joined with the NC index seek results in its place.

    Interesting.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply