Unexpected Clustered Index Scan

  • We have 10 large 15 million+ row tables that all have 1 index(clustered) on a column called individual_id_number. These tables are never updated and never have rows inserted into them. I tried to do an update to a smaller 100k row table against these large tables with the below update and had some unexpected results:

    UPDATEa

    SETa.field1 = b.field1

    FROMdbo.TABLE_TO_UPDATE a

    JOINdbo.LARGE_TABLE_1 b

    ONa.individual_id_number = b.individual_id_number

    It seems that when the smaller table was bumping up against one of the large tables that had ~15-20million rows it would do an index scan and bumping up against a table with ~21+ million rows would do a seek.

    The estimated operater cost never changed much between a scan/seek but whenever the execution plan showed a scan it always came with a high amount of I/O (~1000) while the seeks were around 0.00001 or so. I guess I just don't get why the optimizer would use a scan over a seek on such a simple update. any ideas?

  • Please post table definitions, index definitions and execution plan, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Sorry about that I went ahead and attached the index script as well as the actual execution plans. The first plan is hitting the smaller 16million row table while the second plan hits the 20 million row table.

    ALL tables have the exact same structure save the name of the id column which is nothing more than an identity(no PK).

    The following is an abbreviated version of the create table since the tables themselves have about 155 more int/tinyint columns like the education_level but none of which are part of the update statement.

    CREATE TABLE [dbo].[KBM_MAIN_0](

    [kbm_main_0_id] [int] IDENTITY(1,1) NOT NULL,

    [census_education_level_id] [tinyint] NULL,

    [individual_id_number] [bigint] NULL

    ) ON [PRIMARY]

  • rossnruthie (5/13/2010)


    Sorry about that I went ahead and attached the index script as well as the actual execution plans. The first plan is hitting the smaller 16million row table while the second plan hits the 20 million row table.

    ALL tables have the exact same structure save the name of the id column which is nothing more than an identity(no PK).

    The following is an abbreviated version of the create table since the tables themselves have about 155 more int/tinyint columns like the education_level but none of which are part of the update statement.

    CREATE TABLE [dbo].[KBM_MAIN_0](

    [kbm_main_0_id] [int] IDENTITY(1,1) NOT NULL,

    [census_education_level_id] [tinyint] NULL,

    [individual_id_number] [bigint] NULL

    ) ON [PRIMARY]

    the seek operation in 2nd plan have huge difference between exstimated and actual row, second thing there is not sargeable condition.so optimizer will feel good to go with scan rather than seek.

    Additonally i have impression that these two plans have been captured with different amount of data from the SAME tables.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • rossnruthie (5/13/2010)


    We have 10 large 15 million+ row tables that all have 1 index(clustered) on a column called individual_id_number. These tables are never updated and never have rows inserted into them. I tried to do an update to a smaller 100k row table against these large tables with the below update and had some unexpected results:

    UPDATEa

    SETa.field1 = b.field1

    FROMdbo.TABLE_TO_UPDATE a

    JOINdbo.LARGE_TABLE_1 b

    ONa.individual_id_number = b.individual_id_number

    It seems that when the smaller table was bumping up against one of the large tables that had ~15-20million rows it would do an index scan and bumping up against a table with ~21+ million rows would do a seek.

    The estimated operater cost never changed much between a scan/seek but whenever the execution plan showed a scan it always came with a high amount of I/O (~1000) while the seeks were around 0.00001 or so. I guess I just don't get why the optimizer would use a scan over a seek on such a simple update. any ideas?

    The IO cost you are talking about needs to be multiplied with the number of executions. The index scan is only executed once so the total cost is the same as the estimated cost.

    The index seek is executed 444222 times so you must multiply the estimate IO cost with 444222 to get the total estimated IO cost.

    The reason the optimizer chooses a scan or a seek in this case is simply that it chooses the plan with the lowest estimated total cost.

    If you think the optimizer chose the wrong plan you can force it to always uses a loop join by using a hint.

    If you are interested in speeding up this particular query you could create a non-clustered unique index on individual_id_number INCLUDE(field1) on the large table. Such an index is much more narrow than the original table and it will require much less IO to perform the join. Making the index unique also tells the optimizer that there will be at most one hit for each individual which will help it to generate a more optimal plan.

    Note that with this index, the optimizer will tend to choose an index scan even more than before, but that is proably exactly what you want.

    /SG

  • Stefan_G (5/14/2010)


    If you think the optimizer chose the wrong plan you can force it to always uses a loop join by using a hint.

    However a loop join to join 16 million rows to 20 million rows is insane. Loop joins work well if one of the tables has a small number of rows or a fairly selective filter. Neither is the case here.

    Note that with this index, the optimizer will tend to choose an index scan even more than before, but that is proably exactly what you want.

    Indeed, because it has only 2 real options here:

    1) Scan one table and seek for matching rows in the other. Insanely expensive considering the number of rows

    2) Scan both (there are no filter predicates here) and do a merge or hash join.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/14/2010)


    However a loop join to join 16 million rows to 20 million rows is insane. Loop joins work well if one of the tables has a small number of rows or a fairly selective filter. Neither is the case here.

    You have misread the query.

    The join is from 400.000 rows to 16 million rows.

    So, the optimizer has to choose between performing 400.000 seeks into the big table or a single scan of 16 million rows.

    Not so easy to know which plan is the best...

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply