How storage engine does clustered index scan operation

  • I would like to know, how the storage engine does the "clustered index scan" operation. I mean whether the clustered index scan, always reads the data pages of the table, sequentially or not...

    I'm running a query against a large table (Table Record Count - 11 Million; Table Size - 52 GB). See below the simplified version

    SELECT TOP 1000

    RowID, DateColumn_1, CHARColumn_2, CHARColumn_3,INTColumn_4

    FROM

    LargeTable

    WHERE

    DateColumn_1 >= '01/01/2008'

    AND DateColumn_1 <= '01/01/2009'

    Here,

    1. The column RowID is an identity column and it is defined as a primary key (clustered index) in the table.

    2. Some of the columns in the select clause, are not covered by any of the indices in that table.

    3. DateColumn_1 has an Non Clustered Index. I have attached the column statistics for your reference.

    4. Within this date range (used in the query), there are actually 2.6 Million records in the database.

    5. No other queries are executed in parallel, with this query.

    When i execute this query in a SQL 2005 environment (20 GB RAM), the (actual) execution plan shows a clustered index scan. But the query completes the execution in a second. Using the statistics IO option, i noticed that only about 3000 logical pages were actually read for this query execution. I have 2 questions,

    1. The date column "DateColumn_1" used in the where clause, is like a createdate column. It is logically aligned with the RowID primary key. i.e, Year 2005 records would have been inserted first in the table, followed by Year 2006, 2007, 2008,... I expected that the clustered index scan, would read the data pages sequentially from the first page of the table (Say from year 2005 data pages). In such case, the execution time should be more than what i have got.

    But actually that is not the case, as the execution completed in a second and entire operation has read only 3000 logical pages. How does the storage engine perform the clustered index scan? Is there any algorithm used internally instead of sequential read? or am i missing anything obvious?

    2. Why the SQL optimizer didn't use a non clustered index seek (on the date column) and a key lookup (on the clustered index) to complete the query. Even though the date range is really big, the query expects only the top 1000 records. As you normally expect, for smaller date range, the query uses the non clustered index on the date column.

    Thanks in advance for any help. If you need more details, please let me know.

  • A clustered index scan doesn't have to be in the order of the clustered index. It's just a read of all1 of the leaf pages of the table. Two ways it can be executed, either by reading along the leaf level following the next and previous page pointers or it can use the IAM page (which lists what pages are allocated to the index) and that reads pages in the order they are in the file.

    I'm guessing that what happened here is that, because of the TOP, SQL knew that it only needed some of the table and hence terminated the scan once it had enough rows. Because there's no ORDER BY, any 1000 rows will satisfy the query.

    I would further guess that the nonclustered index isn't used because it's not covering and there are too many rows to make the lookups optimal. If you're curious, force the nonclustered index and see how the query stats compare.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    Thanks for your reply.

    I understand that because of the TOP operator, SQL terminated the clustered index scan once it had enough rows. But what surprises me, is the no. of logical reads (around 3000) actually used for that scan... That table has around 5 Million Data pages in total... Out of the 5 Million data pages, the optimizer completes the request in just 3000 records... Seems like, the optimizer knew that the request could be satisfied with few page scan and their possible location too.... I guess that may be the reason for using the clustered index scan rather than the key lookup... But i don't know, how the SQL engine made that decision and how it executed the same.

    I even tried this for different (larger) date ranges. I noticed that whenever the clustered index scan is used, the logical read in most (but not all) of the scenario is very less...

    For the second part, I think the key lookup would still be optimal as the no. of records it needed to return, is only 1000 (Even though there are more qualifying rows). Please let me know if my understanding is wrong.

    I actually forced the query to use Non clustered index. The query execution completed quickly as expected and used only around 3000 logical reads for that. Based on the stats, there is no difference between the two approaches.

    I also would like to add that, most of the data pages are in the RAM. Also the table is not partitioned but the database is spread across in 5 data files (across 5 drives). I guess these things doesn't make any difference...

    Please let me know your thoughts.

  • MathiPrakash (12/27/2009)


    Out of the 5 Million data pages, the optimizer completes the request in just 3000 records...

    Not 3000 records. 3000 pages. (unless you have rows > 4000 bytes). Reads are measures in pages.

    For the second part, I think the key lookup would still be optimal as the no. of records it needed to return, is only 1000 (Even though there are more qualifying rows). Please let me know if my understanding is wrong.

    Can't say without seeing the exec plan, but generally the index seek/scan will return more rows than necessary and the TOP operator is then applied right at the end of the execution plan, just before returning the data. This is so that joins/filters can't reduce the number of valid rows to below that needed for the TOP (unless there really are fewer qualifying rows in the table).

    I actually forced the query to use Non clustered index. The query execution completed quickly as expected and used only around 3000 logical reads for that. Based on the stats, there is no difference between the two approaches.

    The optimiser prefers not to use key lookups, because the IOs are random rather than sequential. An index scan on an unfragmented index would be sequential IOs. Random IOs are more costly because of the time required to move the disk head to the new location

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    See the attached screenshot of the execution plan, for the query where i forced the non clustered index. As you can see, the non clustered index seek returns exactly 1000 records for Key lookup operation.... I have even tried changing the TOP 1000 with TOP 1, the result is the same. Also my query doesn't have any other filter or join condition.

    In this case, I'm wondering why the optimizer doesn't do the key lookup, without me enforcing it.

    Sincere apologies for not attaching the actual execution plan file, as Iā€™m prohibited from doing so here šŸ™

    Thanks anyway for your help.

  • MathiPrakash (12/27/2009)


    In this case, I'm wondering why the optimizer doesn't do the key lookup, without me enforcing it.

    Offhand, not sure, but I did say in the previous post why the optimiser doesn't like key lookups.

    Force the seek if you like, just make notes to test frequently to ensure that the forced index is still fast. However, since both plans are about the same duration and about the same reads, why bother? It runs fine with the default plan?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • >>WHERE DateColumn_1 >= '01/01/2008' AND DateColumn_1 <= '01/01/2009'

    I would like to question that WHERE clause. Given the dates picked I would think you are looking for one year's worth of data. yet your syntax will get you 1 year and part of 1 day's worth of data (all of the 'extra' day if you have no time values other than 00:00:00). Is that what you intended?

    If you just want all of 2008, this is proper:

    WHERE DateColumn_1 >= '01/01/2008' AND DateColumn_1 < '01/01/2009' --note the missing second equals

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Gail,

    Thanks for helping me... Actually I don't have any performance issue with the default plan. It is just that, I noticed few behavior which I couldn't understand so I wanted to clarify. I'm okay with either plan. I will test it for sure.

    SQL Guru,

    I understand your point. Thanks. I just made the query by myself to explain my situation in a simple way. šŸ™‚

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply