Introduction to Indexes: Part 2 – The clustered index

  • Sure, you can get index intersections. Doesn't mean you always will nor that they are the most optimal approach

    p.s. Yes, part 3 needs an edit in that area, but it did not say that SQL would never seek on all three and intersect.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • "A scan of the clustered index is equivalent to a table scan. It’s a read of all of the data pages in the table...

    This is a full scan of all the data pages, i.e. a table scan..."

    So if a table has 4.5 million rows and the Clustered Index Scan operator properties sheet displays "Actual number of rows 720939" for a query against that table, it's scanning the whole table but only returning 720,939 rows? The sheet subtitle says "Scanning a clustered index, entirely or only a range".

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (10/24/2011)


    So if a table has 4.5 million rows and the Clustered Index Scan operator properties sheet displays "Actual number of rows 720939" for a query against that table, it's scanning the whole table but only returning 720,939 rows?

    Quite likely, yes. There are some cases where a clustered index (or nonclustered index) scan can abort early, that's usually due to TOP or some constructs of Row_number.

    The Actual rows is always the rows returned, not the rows read, so if there's a predicate on that scan, it's the number of rows that qualified for the predicate, not the number read from the table.

    The sheet subtitle says "Scanning a clustered index, entirely or only a range".

    A bit incorrect. A range scan (<clustered key> between @SomeValue and @SomeOtherValue) appears as a seek.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (10/24/2011)


    ChrisM@Work (10/24/2011)


    So if a table has 4.5 million rows and the Clustered Index Scan operator properties sheet displays "Actual number of rows 720939" for a query against that table, it's scanning the whole table but only returning 720,939 rows?

    Quite likely, yes. There are some cases where a clustered index (or nonclustered index) scan can abort early, that's usually due to TOP or some constructs of Row_number.

    The Actual rows is always the rows returned, not the rows read, so if there's a predicate on that scan, it's the number of rows that qualified for the predicate, not the number read from the table.

    The sheet subtitle says "Scanning a clustered index, entirely or only a range".

    A bit incorrect. A range scan (<clustered key> between @SomeValue and @SomeOtherValue) appears as a seek.

    Awesome. Many thanks Gail.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • An integer column defined as identity (a very good and popular choice for clustered index) is 4 bytes.

    It's certainly popular, but most often not good. The no-thought identity as clustered key likely causes more performance problems than any other single thing.

    Most "guidelines" for the clus key affect mostly the INSERT of new rows, which is done only once. Keep in mind that rows will typically be read dozens or hundreds of times after being INSERTed. In fact, even very likely thousands of times if you force too many table scans by having the wrong clustered key.

    Nonclustered indexes should not be thought of as the main way to access data. You want the primary access to be through the clustered index.

    Often a (date) OR (parent_key, child_key) (such as order_number, item_number) makes a vastly better clustering key than identity.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (7/6/2012)


    An integer column defined as identity (a very good and popular choice for clustered index) is 4 bytes.

    It's certainly popular, but most often not good. The no-thought identity as clustered key likely causes more performance problems than any other single thing.

    We'll have to agree to disagree on that matter. I hold to the school of thought that the clustered index should first organise the data (narrow, unchanging, ever-increasing, unique). If it can also support queries that's a bonus.

    There's only one clustered index, multiple nonclustered indexes, clustered index to organise the table, nonclusters to support the queries.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/6/2012)

    There's only one clustered index, multiple nonclustered indexes, clustered index to organise the table, nonclusters to support the queries.

    Every column added to a query invalidates the existing covering index and will revert back to a table scan.

    How do you keep up with every single-column change to thousands of queries for thousands of databases and tens of thousands, or even millions, of tables, to keep all your non-clus indexes as truly covering?

    Just not possible, practical or worthwhile.

    The most important performance aspect of a table is determining the correct clustered index.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (7/6/2012)


    GilaMonster (7/6/2012)Every column added to a query invalidates the existing covering index and will revert back to a table scan.

    Not necessarily a table scan. Maybe a seek + lookup, maybe index intersection, maybe an index scan. SQL has lots of options for indexes.

    How do you keep up with every single-column change to thousands of queries for thousands of databases and tens of thousands, or even millions, of tables, to keep all your non-clus indexes as truly covering?

    I don't. I make covering indexes for critical queries, ensure that not-so-critical queries perform within requirements and make sure that new developments are correctly performance tested and that the developers have made any necessary index creations or modifications before the change are deployed to production.

    How do you keep up with changes to all the queries across thousands of databases and tens of thousands, or even millions tables to keep all your nonclustered indexes up to date and useful for the queries that don't use the clustered index, seeing as I have very seldom seen tables with one and only one access path? (and the cluster can only support one access path)

    The most important performance aspect of a table is determining the correct clustered index.

    Absolutely, and what I want is one that is narrow - doesn't make nonclustered indexes larger than they need to be -, unique - doesn't require a uniquifier column - , unchanging - that isn't going to require that every nonclustered index on the table be updated when the clustering key value is changed - and ever-increasing - that isn't going to cause huge fragmentation due to inserts.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • The most important performance aspect of a table is determining the correct clustered index.

    Absolutely, and what I want is one that is narrow - doesn't make nonclustered indexes larger than they need to be -, unique - doesn't require a uniquifier column - , unchanging - that isn't going to require that every nonclustered index on the table be updated when the clustering key value is changed - and ever-increasing - that isn't going to cause huge fragmentation due to inserts.

    The same rote quote, and, yes, those are good general rules, but not enough by themselves.

    Please, those of you learning about clustered indexes, look beyond the no-thought approach.

    Instead, consider the actual requirements of your specific table before determining the clustered index key(s) for that table.

    You'll save yourself huge performance problems and the extra work required to re-index the table properly later.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • As a matter of fact, you can’t avoid the scans (table / index) in real life or practical scenarios but you can minimize those and that’s what Gail is trying to explain here.

    Every columns that you add to query as predicate may or may not be covering at the moment. It needs not to be covering especially in ad-hoc query environments it’s very common. If a query that executes rarely, I don’t see any reason to worry much on it unless an explicit weight / priority is assign to that query.

    Also, you would not like to include all the columns in non-clustered index just to make it covering. Would you?

  • Dev (7/8/2012)


    As a matter of fact, you can’t avoid the scans (table / index) in real life or practical scenarios but you can minimize those and that’s what Gail is trying to explain here.

    Every columns that you add to query as predicate may or may not be covering at the moment. It needs not to be covering especially in ad-hoc query environments it’s very common. If a query that executes rarely, I don’t see any reason to worry much on it unless an explicit weight / priority is assign to that query.

    Also, you would not like to include all the columns in non-clustered index just to make it covering. Would you?

    Not at all. That's why I clearly stated, the (vast) majority of the queries should be using the clustered index to limit rows.

    What Gail stated is:

    clustered index to organise the table, nonclusters to support the queries.

    I make covering indexes for critical queries

    If nonclus are to support the queries, they need to be:

    1) covering indexes OR

    2) select only a very small % of the total rows

    Since you can't directly control #2, aren't you then forced to make covering indexes?

    If the query restricts to a minimum range of rows based on the clustered index, I don't have to make any adjustments then even if they select/compare every column in the table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • What does that have to do with the clustered index key, though? You certainly can't cover every possible query with the clustered index (unless you have an incredibly simple database), so you have to have nonclustered ones--and if you have them, then having the clustering key as small as possible (since it needs to be repeated in every NC index) is surely a good idea, which is pretty much what Gail said?

  • Let me put this way; I have a table XYZ and an ID (Primary, Clustered Index). I typically don’t query on ID column but it’s used in many queries for joins. I typically search based on Description (Non-Clustered index).

    Now in this case, assume I don’t have Non-Clustered index in place and I query based on Description column, it will scan the cluster. Will it be a good plan? If I have Non-Clustered index (on description) in place, based on selectivity optimizer may pick the index. Isn’t it a good plan?

    In continuation to the same, if the query selects few non-key columns as well, I may include couple of columns (mostly queried) in it. If this index can satisfy 60-80% queries on that table I will consider it a good index.

    The major fact, we can have only one Clustered Index on a table (ignoring workaround with INCLUDE). ‘select only a very small % of the total rows’ is applicable to Clustered Index as well for seek operation else it will scan the cluster.

  • Dev (7/9/2012)‘select only a very small % of the total rows’ is applicable to Clustered Index as well for seek operation else it will scan the cluster.

    False, of course. If, for example, you have a table with two years' worth of data, 2010 and 2011, clustered on date, and you specify >= '20110101', SQL will read only the 2011 data.

    That is another main reason you should leverage the clus index as much as possible.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (7/9/2012)


    Dev (7/9/2012)‘select only a very small % of the total rows’ is applicable to Clustered Index as well for seek operation else it will scan the cluster.

    False, of course. If, for example, you have a table with two years' worth of data, 2010 and 2011, clustered on date, and you specify >= '20110101', SQL will read only the 2011 data.

    That is another main reason you should leverage the clus index as much as possible.

    It’s just an optimization added to SCAN operation but it won’t be good anyways if the data is skewed and ‘>= 20110101’ returns 90% of rows.

Viewing 15 posts - 106 through 120 (of 122 total)

You must be logged in to reply to this topic. Login to reply