Date string in WHERE causes table scan instead of index seek

  • Joseph Fallon (6/11/2009)


    ...

    select * from [investing fund transactions] where allocation_date > '2009-06-01'

    and prealloc = 1

    Returns 50 records, uses index

    select * from [investing fund transactions] where allocation_date < '2009-06-01'

    and prealloc = 1

    Returns 3625 records, does not use index

    select fund_no from [investing fund transactions] where allocation_date < '2009-06-01'

    and prealloc = 1

    Always uses the index

    See previous posts in this thread regarding the optimizer's decisions regarding using an index and the number of records likely to be returned. The first select is a "SELECT *" and the "<" date is pretty recent, so you're basically asking "Give me almost everything in the table, then filter it by this bit", which it decides is a table scan. The third query will use the index if fund_no is covered, I believe.

    You'd probably get better results having the index on prealloc then allocation_date, and changing the query to where prealloc = 1 and allocation_date < '2009-06-01'?

    --------
    [font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]

  • GilaMonster (6/11/2009)


    Joseph Fallon (6/11/2009)


    When I don't use "SELECT * FROM " the query always uses the covering index. Thanks for the tip Mr Stobbs.

    Is the index really covering for SELECT *? Does it have every single column in the table somewhere in the index (key or include)?

    You got me - I thought that a covering index only had to include the fields in the WHERE clause.

    Rob Goddard (6/11/2009)


    Joseph Fallon (6/11/2009)


    ...

    select * from [investing fund transactions] where allocation_date > '2009-06-01'

    and prealloc = 1

    Returns 50 records, uses index

    select * from [investing fund transactions] where allocation_date < '2009-06-01'

    and prealloc = 1

    Returns 3625 records, does not use index

    select fund_no from [investing fund transactions] where allocation_date < '2009-06-01'

    and prealloc = 1

    Always uses the index

    See previous posts in this thread regarding the optimizer's decisions regarding using an index and the number of records likely to be returned. The first select is a "SELECT *" and the "<" date is pretty recent, so you're basically asking "Give me almost everything in the table, then filter it by this bit", which it decides is a table scan. The third query will use the index if fund_no is covered, I believe.

    You'd probably get better results having the index on prealloc then allocation_date, and changing the query to where prealloc = 1 and allocation_date < '2009-06-01'?

    Thanks Rob, will try that.

  • Joseph Fallon (6/11/2009)


    GilaMonster (6/11/2009)


    Joseph Fallon (6/11/2009)


    When I don't use "SELECT * FROM " the query always uses the covering index. Thanks for the tip Mr Stobbs.

    Is the index really covering for SELECT *? Does it have every single column in the table somewhere in the index (key or include)?

    You got me - I thought that a covering index only had to include the fields in the WHERE clause.

    An index is covering for a query if it contains all of the columns that the query needs. Means that the query can be satisfied without needing to read from the cluster/heap.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Christopher Stobbs (6/11/2009)


    Index should be: (BitField, DateField).

    If BitField is really a BIT field then it cannot be included into an index.

    And if it's not it should not really be used in an index, at least not at the first place.

    Always go for most selective one first.

    0/1 fields may help only if you are after the selection of 1% or less from the total recordset.

    _____________
    Code for TallyGenerator

  • Hi Sergiy,

    The index that I created to test the OP's problem definitely has a field in postion 1 that is of type 'BIT'.

    I do agree with you I would probably never use a bit field in an index as you mentioned the selectivity is low.

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Sergiy (6/11/2009)


    Christopher Stobbs (6/11/2009)


    Index should be: (BitField, DateField).

    If BitField is really a BIT field then it cannot be included into an index.

    Sure it can. It's only the LOB data types that can't be in the index key. Enterprise manager won't allow a bit in the index key, but that's a tool issue not a restriction in SQL.

    And if it's not it should not really be used in an index, at least not at the first place.

    Always go for most selective one first.

    http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (6/12/2009)


    Sergiy (6/11/2009)


    Christopher Stobbs (6/11/2009)


    Index should be: (BitField, DateField).

    If BitField is really a BIT field then it cannot be included into an index.

    Sure it can. It's only the LOB data types that can't be in the index key. Enterprise manager won't allow a bit in the index key, but that's a tool issue not a restriction in SQL.

    Well, let's say I totally support designers of EM in this.

    It's been so long time since I tried to create an index on BIT column I forgot it's may be done.

    And if it's not it should not really be used in an index, at least not at the first place.

    Always go for most selective one first.

    http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

    I did not find anything in the article which would contradict my point.

    Or was it in support?

    I found some logical errors and misleading points in that article though.

    For example, you (it's your article, right?) are talking about non-clustered index on "heap" table there. But you non-clustered index is a single index on the table and covers all the columns of it. So, effectively, it's a copy of the table, its version with clustered index.

    By using wrong terms you lead people to wrong conclusions.

    Another error is here:

    For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

    Inequalities give the same kind of idea about number of rows as equalities.

    And order of columns in an index is as important as for equality predicates.

    What is missing in your article is a second index.

    You need a table with non-selective (bit-like, 0 and 1) column and highly selective another one (like datetime column discussed in this topic).

    Create 2 indexes: (Col1, Col2) and (Col2).

    Then you run run a query :

    WHERE Col1 = 1 AND Col2 Between Val1 and Val2.

    If you take close values for Val1 and Val2 optimizer will always choose to go for inequality first and then scan for equality.

    You would do the same.

    Imagine, you need to find in a book all the odd pages having keywords between "keyphrase" and "keyword".

    You've got 2 indexes:

    - one will give you the list of all odd pages, and each page reference has the list of keywords on it;

    - another one has list of all keywords followed be list of pages where they are.

    Would you even bother to use 1st index?

    Of course, if you'd need to select keywords between "B" and "Y" then you'd go for odd pages first, as that list would be more selective.

    In conclusion I repeat my point:

    Always go for most selective column first.

    No matter if it's equality or inequality comparison.

    _____________
    Code for TallyGenerator

  • Sergiy (6/13/2009)


    I did not find anything in the article which would contradict my point.

    Or was it in support?

    No, it's in contradiction. I do not and never have agreed with the statement 'most selective column first', especially not when given alone without any other info on index column order.

    For example, you (it's your article, right?) are talking about non-clustered index on "heap" table there. But you non-clustered index is a single index on the table and covers all the columns of it. So, effectively, it's a copy of the table, its version with clustered index.

    It's a covering, nonclustered index on a heap. I could add 10 more columns to the table and it wouldn't matter in the slightest as long as they weren't part of the query.

    Another error is here:

    For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

    Inequalities give the same kind of idea about number of rows as equalities.

    Not from selectivity alone.

    If I said that the selectivity of ColX (int, not null) is 0.5, then you can correctly assume that on average an equality match that returns data will return 2 rows. That's what selectivity means. From the selectivity alone however there's no way to say how many rows an inequality will return.

    Can you, from nothing but the selectivity (which we'll say is 0.5 here) tell how many rows will be returned by (ColX < 100)?

    Note, that piece you quoted was only talking about selectivity. Not the histogram, the data distribution or anything else.

    And order of columns in an index is as important as for equality predicates

    Yes and no. (btw, there's another post on equality predicates)

    With equalities (Say ColA = @a AND ColB = @b-2 AND ColC = @C), you'll get an index seek on all three regardless of what the order of the columns in the index are (Providing only those 3 columns are in that index). That is not the case when the predicate becomes (Say ColA = @a AND ColB = @b-2 AND ColC < @C). Put ColC as the leading column and SQL has to do a single column seek on ColC (seek predicate) and filter the other two later (predicate).

    What is missing in your article is a second index.

    You need a table with non-selective (bit-like, 0 and 1) column and highly selective another one (like datetime column discussed in this topic).

    Create 2 indexes: (Col1, Col2) and (Col2).

    Then you run run a query :

    WHERE Col1 = 1 AND Col2 Between Val1 and Val2.

    If you take close values for Val1 and Val2 optimizer will always choose to go for inequality first and then scan for equality.

    CREATE TABLE InequalityTest (

    IsEven BIT NOT NULL,

    SomeValue INT

    )

    Insert into InequalityTest (IsEven, SomeValue)

    SELECT RowNo%2, RowNo FROM

    (SELECT TOP (100000) ROW_NUMBER() OVER (Order by (select 1)) AS RowNo FROM master.sys.columns a cross join master.sys.columns b) sub

    Create Index idx_1 ON InequalityTest (IsEven, SomeValue)

    Create Index idx_2 ON InequalityTest (SomeValue, IsEven)

    -- They're both on both columns because comparing a covering index with a noncovering index is senseless. Bookmark/key lookups are expensive.

    GO

    -- nonclustered index seek on idx_1

    SELECT * FROM InequalityTest WHERE IsEven=1 AND SomeValue Between 1 and 2 OPTION (RECOMPILE)

    -- nonclustered index seek on idx_1

    SELECT * FROM InequalityTest WHERE IsEven=1 AND SomeValue Between 10 and 20 OPTION (RECOMPILE)

    -- nonclustered index seek on idx_1

    SELECT * FROM InequalityTest WHERE IsEven=1 AND SomeValue Between 100 and 200 OPTION (RECOMPILE)

    -- nonclustered index seek on idx_1

    SELECT * FROM InequalityTest WHERE IsEven=1 AND SomeValue Between 1000 and 2000 OPTION (RECOMPILE)

    -- nonclustered index seek on idx_1

    SELECT * FROM InequalityTest WHERE IsEven=1 AND SomeValue Between 10000 and 20000 OPTION (RECOMPILE)

    The recompile is there to ensure that I'm not getting a cached plan reused. In all of those cases the optimiser chose the index with the bit first, ie it decided that it was better to do the equality first, then the inequality

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    I'm not sure how bad is SQL2005, but my experience is built on SQL2000 real life scenarios and it quite contradicts your statement.

    Because it's actually SQL2000 forum, I allowed myself to converted your example into SQL2000 compatible script.

    CREATE TABLE InequalityTest (

    IsEven BIT NOT NULL,

    SomeValue INT

    )

    Insert into InequalityTest

    (IsEven, SomeValue)

    SELECT N%2, N

    FROM dbo.Tally

    -- (65536 row(s) affected)

    Create Index idx_1 ON InequalityTest (IsEven, SomeValue)

    Create Index idx_2 ON InequalityTest (SomeValue, IsEven)

    GO

    I believe I did not change anything meanungful so far.

    Then I ran same set of tests.

    SET STATISTICS PROFILE ON

    -- nonclustered index seek on idx_1

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 1 and 2

    -- nonclustered index seek on idx_1

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 10 and 20

    -- nonclustered index seek on idx_1

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 100 and 200

    -- nonclustered index seek on idx_1

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 1000 and 2000

    -- nonclustered index seek on idx_1

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    SET STATISTICS PROFILE OFF

    Correct me if I'm wrong but DBCC FREEPROCCACHE deletes all stored plans and forces recompilation of all queries.

    Here is shortened version of the query plan for 1st query:

    Rows Executes StmtText

    1 1 SELECT * FROM [InequalityTest] WHERE [IsEven]=@1 AND ([SomeValue]>=@2 AND [SomeValue]= Convert([@2]) AND [InequalityTest].[SomeValue] <= Convert([@3])), WHERE: (Convert([InequalityTest].[IsEven])=[@1]) ORDERED FORWARD)

    Same query plan was chosen for all other queries, even for least selective on inequal comparison.

    Full statistics of the queries are in the attachment.

    As you can see SQL Server 2000 always went for index seek on idx_2, never on idx_1.

    And it makes perfect sense in terms of used resourses, execution time and overall performance.

    Will try you scripts on SQL2005, and if the test will confirm your statment it will be another point to the statement that SQL2000 remains the best SQL Server platform MS ever developed.

    _____________
    Code for TallyGenerator

  • To add to the picture.

    I decided to force using index idx_1:

    SET STATISTICS PROFILE ON

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    DBCC FREEPROCCACHE

    SELECT * FROM InequalityTest WITH (index(idx_1))

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    SET STATISTICS PROFILE OFF

    Query plans are attached.

    As you can see the option using idx_1 is about 7 times (!) more expensive than option using idx_2.

    _____________
    Code for TallyGenerator

  • Can you show Statistics IO and Statistics time output please? The costs in the exec plan are estimates and I've known them to be wrong before.

    On SQL 2008 (I don't have 2000 available anywhere any longer), I get the following execution characteristics

    First the index with the bit column first. This is the index that 2008 selects by default

    SELECT * FROM InequalityTest WITH (INDEX=idx_1)

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    Table 'InequalityTest'. Scan count 1, logical reads 15, physical reads 0

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 228 ms.

    35% of the overall query cost

    Now forcing the index with the column used in the inequality first.

    SELECT * FROM InequalityTest WITH (INDEX=idx_2)

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    Table 'InequalityTest'. Scan count 1, logical reads 26, physical reads 0

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 345 ms.

    65% of the overall query cost.

    So forcing the index that has the column used for the inequality first incurs 11 more IOs and 100 ms longer execution time.

    Exec plans (as .sqlplan files) attached. You'll notice that in the first query the index seek's seek predicte is on two columns while in the second the seek predicate is only on one column and there's a secondary filter (predicate) applied after.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Here it is:

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    DBCC FREEPROCCACHE

    SELECT IsEven,SomeValue FROM InequalityTest

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    DBCC FREEPROCCACHE

    SELECT IsEven,SomeValue FROM InequalityTest WITH (index(idx_1))

    WHERE IsEven=1 AND SomeValue Between 10000 and 20000

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    Output:

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 4 ms.

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 1 ms.

    (5000 row(s) affected)

    Table 'InequalityTest'. Scan count 1, logical reads 25, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 40 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    SQL Server Execution Times:

    CPU time = 3 ms, elapsed time = 3 ms.

    (5000 row(s) affected)

    Table 'InequalityTest'. Scan count 1, logical reads 155, physical reads 0, read-ahead reads 0.

    _____________
    Code for TallyGenerator

  • Interesting. The second's a full index scan. On 2008 I get seeks on both.

    You'll note that the number of reads of the more optimal one on 2000 is the same as the number of reads on the less optimal on on 2008. Looks like the optimiser got improved between the versions and can now recognise that a seek can be done.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'm wondering if there's an implicit conversion happening. I've seen it before on comparisons with BIT columns and this suggests it's here too.

    Convert([InequalityTest].[IsEven])=1

    If there is, it'll explain why there's an index scan despite a sargable predicate.

    If you're willing to humour me, can you try these?

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    DBCC FREEPROCCACHE

    SELECT IsEven,SomeValue FROM InequalityTest

    WHERE IsEven=CAST(1 AS BIT) AND SomeValue Between 10000 and 20000

    DBCC FREEPROCCACHE

    SELECT IsEven,SomeValue FROM InequalityTest WITH (index(idx_1))

    WHERE IsEven=CAST(1 AS BIT) AND SomeValue Between 10000 and 20000

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I cannot remeber if this was fixed in a patch, but SQL200 used to have issues with BIT columns. You might want to re-run your tests by CASTing the literal value to a bit before hand.

    I used to work on a SQL 2000 DB with tables in 100s of millions a billions of rows and there were several BIT fields and the simple act of casting the literal could increase performace by more than double in some cases.

    EDIT: Crap, figures I didn't read the last reply by Gail. It looks like she alrady got it.. 🙂

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply