August 28, 2018 at 2:48 pm
ScottPletcher - Tuesday, August 28, 2018 1:19 PMJeff Moden - Tuesday, August 28, 2018 12:15 PMScottPletcher - Tuesday, August 28, 2018 11:54 AMJeff Moden - Tuesday, August 28, 2018 11:43 AMScottPletcher - Tuesday, August 28, 2018 9:29 AMIf you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.
Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.
Unless the values are just extraordinarily random, that's almost never an issue. Nothing's a panacea, but best clustering is the closest thing to a panacea that you have.
Don't forget that when the data is properly clustered, you can then remove all non-clus indexes that start with ( ColA ) or ( ColA, ColB ).
Contrary to popular belief, extraordinarily random columns would actually be better than anything that has a limited domain, suffers out of order inserts or updates, or is subject to inserts followed by expansive updates. And, no... clustering isn't even close to being a panacea because they don't actually know of all the other things that I speak of.
I've got thirty years of experience with rdbms tuning, the last 17 years focused on SQL Server. I've had only a handful of dbs where the proper clustering for a majority of tables was an identity column. There's most often something much better.
Yes, one should check index usage stats, missing index stats, cardinality and other things, but most often, identity is the wrong clustering for best overall performance. All the I/O from the gazillions of covering indexes really adds up.
I'm not talking about the IDENTITY thing. I'm saying that Clustering isn't a panacea based only on the most common used scripts. And stop knocking your bloody ring. You aren't the only one that has one. 😉 I may not have 30 years in rdbms tuning but I have been dedicated to SQL Server for 22. Even then, there are people that have pedigrees that will make us both look like kids. 😛 I've also done a huge amount of testing in the last 10 months because of problems in most indexes caused by multiple common beliefs including the one that says you should cluster based on the most common queries.
Hmmm.... that does give me an idea for an article.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 29, 2018 at 8:23 am
Jeff Moden - Tuesday, August 28, 2018 2:48 PMScottPletcher - Tuesday, August 28, 2018 1:19 PMJeff Moden - Tuesday, August 28, 2018 12:15 PMScottPletcher - Tuesday, August 28, 2018 11:54 AMJeff Moden - Tuesday, August 28, 2018 11:43 AMScottPletcher - Tuesday, August 28, 2018 9:29 AMIf you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.
Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.
Unless the values are just extraordinarily random, that's almost never an issue. Nothing's a panacea, but best clustering is the closest thing to a panacea that you have.
Don't forget that when the data is properly clustered, you can then remove all non-clus indexes that start with ( ColA ) or ( ColA, ColB ).
Contrary to popular belief, extraordinarily random columns would actually be better than anything that has a limited domain, suffers out of order inserts or updates, or is subject to inserts followed by expansive updates. And, no... clustering isn't even close to being a panacea because they don't actually know of all the other things that I speak of.
I've got thirty years of experience with rdbms tuning, the last 17 years focused on SQL Server. I've had only a handful of dbs where the proper clustering for a majority of tables was an identity column. There's most often something much better.
Yes, one should check index usage stats, missing index stats, cardinality and other things, but most often, identity is the wrong clustering for best overall performance. All the I/O from the gazillions of covering indexes really adds up.
I'm not talking about the IDENTITY thing. I'm saying that Clustering isn't a panacea based only on the most common used scripts. And stop knocking your bloody ring. You aren't the only one that has one. 😉 I may not have 30 years in rdbms tuning but I have been dedicated to SQL Server for 22. Even then, there are people that have pedigrees that will make us both look like kids. 😛 I've also done a huge amount of testing in the last 10 months because of problems in most indexes caused by multiple common beliefs including the one that says you should cluster based on the most common queries.
Hmmm.... that does give me an idea for an article.
I didn't say cluster based only on "most common queries". I said if (almost) all of the critical queries are done using a given column(s).
Although I'd take clustering based on a large majority of queries over one based on a "default" of identity any day. There should be no such thing as "default clustering key".
Isolated fragmentation can be easily and efficiently deal with by defragmenting the table.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 29, 2018 at 8:24 am
It'd be interesting if the current poster would cluster his table on ( ColA, ColB ), then run the original queries. The results will be vastly better than before, and they won't have to worry about stats or hitting a tipping point later where the query than ran fine yesterday suddenly scans the entire table today.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 29, 2018 at 8:43 am
ScottPletcher - Wednesday, August 29, 2018 8:23 AMI didn't say cluster based only on "most common queries". I said if (almost) all of the critical queries are done using a given column(s).Although I'd take clustering based on a large majority of queries over one based on a "default" of identity any day. There should be no such thing as "default clustering key".
Isolated fragmentation can be easily and efficiently deal with by defragmenting the table.
Ok... there's not much difference there, Scott. Clustering on the "critical queries" can actually slow things down quite a bit because of the number of pages that would need to be loaded into memory instead of a nice narrow dedicated non-clustered index.
And, not.... the "isolated fragmentation" won't be "efficiently deal with by defragmenting the table especially if the table is large.
And there you go again with the "Identity" thing. I'm not talking about the "Identity" thing nor am I suggesting that anyone automatically default to it. But there is a much larger picture that must be considered than just "queries" when selecting a clustered index.
Another thing that people seem to forget about when selecting a clustered key is the effect that page splits will have on the system and table, especially if the table is wide. You can end up with some rather serious "siloing" and some "across the board" page splits on INSERT because the inserts will likely be "out of logical order" for the clustered index. That can very quickly result in wasting 50% (and sometimes much more) of your memory not to mention the additional entries made (everything about a page split, including data movement caused by the split, is fully logged) on the log file (which further slows things down) and the additional blocking (page splits occur across at least 3 pages, 2 existing and 1 new) caused by the system level transaction that occurs. Since clustered indexes tend to be much wider that non-clustered indexes, any FILL FACTOR may end up being an additional total waste of memory because INSERTs don't follow a FILL FACTOR. They only use space on pages where the new rows fit (including that caused by a FILL FACTOR) and will do a split if not. Since the rows are much wider, there's a much greater chance (sometimes a guarantee) of splits if the inserts into the CI are, in fact, out of logical order.
Suggesting that the clustered index be based on the either the most critical or the most common queries is simply replacing one bad default mentality with another, especially on larger tables. If your most critical query depends only on a couple of columns in a large table, clustering based on that query is a huge waste of the clustered index and will cause either SQL Server to opt to use a scan instead of a seek and range scan or cause the query to run much slower than it should because of all the extra pages of data that it has to plog through.
The bottom line is as with all else in SQL Server... "It Depends". Substituting one bad default thought process for another one isn't the way to go.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 29, 2018 at 9:04 am
It is gratifying to have stimulated such a lively debate particularly when three of my earlier posts received no replies at all. From my perspective the matter has been satisfactorily resolved, as the query now returns instant results. By way of further background it's a third party-supplied schema, so I am rather stuck with it, and the problem arose in a test environment that has been little used lately. Hopefully regular use and index maintenance will ensure that it doesn't recur in production.
August 29, 2018 at 9:23 am
jontris - Wednesday, August 29, 2018 9:04 AMIt is gratifying to have stimulated such a lively debate particularly when three of my earlier posts received no replies at all. From my perspective the matter has been satisfactorily resolved, as the query now returns instant results. By way of further background it's a third party-supplied schema, so I am rather stuck with it, and the problem arose in a test environment that has been little used lately. Hopefully regular use and index maintenance will ensure that it doesn't recur in production.
Your post (this one) was made yesterday and a lot of people responded even though there was very little to go on (no table information, no number of rows, no execution plan, etc) so I'm not sure what you mean by "received no replies at all". When you say, "earlier posts", are you speaking of different posts other than this one? If so, it may have been due to the same lack of information and people simply didn't know how to respond.
Also, was there anything that you did to get your query to "return instant results" other than rebuilding an existing index or was that the whole solution?
Understood on the limits of 3rd party stuff. It can be really frustrating.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 29, 2018 at 10:16 am
Jeff Moden - Wednesday, August 29, 2018 8:43 AMScottPletcher - Wednesday, August 29, 2018 8:23 AMI didn't say cluster based only on "most common queries". I said if (almost) all of the critical queries are done using a given column(s).Although I'd take clustering based on a large majority of queries over one based on a "default" of identity any day. There should be no such thing as "default clustering key".
Isolated fragmentation can be easily and efficiently deal with by defragmenting the table.
Ok... there's not much difference there, Scott. Clustering on the "critical queries" can actually slow things down quite a bit because of the number of pages that would need to be loaded into memory instead of a nice narrow dedicated non-clustered index.
And, not.... the "isolated fragmentation" won't be "efficiently deal with by defragmenting the table especially if the table is large.
And there you go again with the "Identity" thing. I'm not talking about the "Identity" thing nor am I suggesting that anyone automatically default to it. But there is a much larger picture that must be considered than just "queries" when selecting a clustered index.
Another thing that people seem to forget about when selecting a clustered key is the effect that page splits will have on the system and table, especially if the table is wide. You can end up with some rather serious "siloing" and some "across the board" page splits on INSERT because the inserts will likely be "out of logical order" for the clustered index. That can very quickly result in wasting 50% (and sometimes much more) of your memory not to mention the additional entries made (everything about a page split, including data movement caused by the split, is fully logged) on the log file (which further slows things down) and the additional blocking (page splits occur across at least 3 pages, 2 existing and 1 new) caused by the system level transaction that occurs. Since clustered indexes tend to be much wider that non-clustered indexes, any FILL FACTOR may end up being an additional total waste of memory because INSERTs don't follow a FILL FACTOR. They only use space on pages where the new rows fit (including that caused by a FILL FACTOR) and will do a split if not. Since the rows are much wider, there's a much greater chance (sometimes a guarantee) of splits if the inserts into the CI are, in fact, out of logical order.
Suggesting that the clustered index be based on the either the most critical or the most common queries is simply replacing one bad default mentality with another, especially on larger tables. If your most critical query depends only on a couple of columns in a large table, clustering based on that query is a huge waste of the clustered index and will cause either SQL Server to opt to use a scan instead of a seek and range scan or cause the query to run much slower than it should because of all the extra pages of data that it has to plog through.
The bottom line is as with all else in SQL Server... "It Depends". Substituting one bad default thought process for another one isn't the way to go.
It's true that many tables are very poorly normalized -- or not normalized at all -- and this leads to more overhead for splits than necessary. That is a major problem, but it's not caused by properly clustering the table, it's caused by the complete lack of logical and physical data modeling, or by having them done by programmers. Still, it's quite common on the pre-existing tables I've tuned and it's actually a major issue only in rare cases.
"And, no.... the "isolated fragmentation" won't be "efficiently deal with by defragmenting the table especially if the table is large."
Again, not really true. SQL will only defragment those pages that need it. Yes, SQL will need to review all the pages, but I'm not sure now (SQL 2016) that SQL even has to move them into a buffer to determine if they need defragmented. Even if it does, it's an online operation and wouldn't need to be run excessively. And very limited areas of a very large table that are the typical max of 50% fragmented aren't going to be a killer for such a large table.
The big thing is not to forget about all the overhead of all those nonclus, covering indexes. Yes, each may individually be relatively narrow, esp. at first, but over time you get more more of them and they become wider. As queries using a covering index add column(s), it(they) must be added to the index. That's where you end up with the 2-8x more overall space being needed, not to mention all the additional INSERTs and UPDATEs.
Also, instead of a page split on one index (the clus), we have splits on potentially, and eventually, all the covering indexes: 6, 8, however many.
It's also not true that SQL will "opt to use a scan instead of a seek and range scan" if you cluster the table by the WHERE = column(s). SQL does not go back to a full scan for a clus index. There is no tipping point: SQL can do a seek then scan even if you're reading 70% of the table. Many people confuse this aspect of the clus index.
In fact, that's one good way to help find tables that are badly clustered now. Say the table is clustered on id/identity, and you see lots of nonclus indexes keyed by colA with different INCLUDEs. It's reasonable to review and determine if perhaps the table should be clustered on colA to being with.
Finally, for standard "child" tables, such as "order_items", the lead clus key should usually be the parent key, which is most often an identity anyway, such as "order_id". Thus, a clus key such as ( order_id, item_id ) is normally vastly better than just ( item_id ) and is still generally sequential and narrow.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 30, 2018 at 3:13 am
@jeff, my 'earlier posts' were different posts and I was trying to be tongue-in cheek rather than make a serious complaint. Perhaps irony doesn't carry into the forums very well. Probably, as you say, people didn't know how to respond to those posts, although I've noticed that doesn't always stop people (tongue-in-cheek again...). Also they were in the 2008 forum, so perhaps of less general interest.
Rebuilding the index, or rather refreshing stats with FULLSCAN (pace Grant) was the whole solution and I'm annoyed with myself for not thinking of it. It's not as if I haven't seen it before. But thanks again, everyone, for your contributions.
August 30, 2018 at 6:43 am
Good discussion here. Choosing clustering keys and fill factors strikes me as one of those "black arts" when it comes to SQL Server and I'm not surprised people gravitate towards the defaults. "It depends" is almost always the answer but having good guidance on how to make the decision, apropos of nothing else, would certainly be helpful.
August 30, 2018 at 8:19 am
jontris - Thursday, August 30, 2018 3:13 AM@jeff, my 'earlier posts' were different posts and I was trying to be tongue-in cheek rather than make a serious complaint. Perhaps irony doesn't carry into the forums very well. Probably, as you say, people didn't know how to respond to those posts, although I've noticed that doesn't always stop people (tongue-in-cheek again...). Also they were in the 2008 forum, so perhaps of less general interest.
Rebuilding the index, or rather refreshing stats with FULLSCAN (pace Grant) was the whole solution and I'm annoyed with myself for not thinking of it. It's not as if I haven't seen it before. But thanks again, everyone, for your contributions.
Ah... understood. Thank you very much for the feedback and, you're correct, I took it as a complaint and missed the irony. Also and just to be sure because so much can be lost in the written word, my response was meant to help improve responses for you in the future and not as a whack across the nose with the proverbial rolled up newspaper.
Shifting gears, I still have more to reply about Scott's latest post but I've been a bit overcome with urgencies at work and in the presentation world. It will be interesting though.
Heh... and your response about being annoyed with yourself drew a big smile from me because I'm frequently annoyed with myself, as well. It's one of the ways that I learn and am forced to remember things. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
October 19, 2018 at 7:42 am
jontris - Tuesday, August 28, 2018 5:30 AMOn a SQL 2016 instance I have a large table with dozens of columns including:
COLA INT
COLB BIT
and nonclustered index:
(COLA ASC, COLB ASC)I observe the following:
select * from table where COLA = n and COLB = 1 /* index scan */
select * from table where COLA = n and COLB = 0 /* index seek */
select * from table where COLA = n /* index seek */
select COLA, COLB from table where COLA = n and COLB = 1 /* index seek */
In other words it behaves as expected in every case except when there is the combination of 'select *' for the column list and 'COLB = 1' as second predicate. The seeks are instant but the scan takes minutes, so this is a problem. I haven't attempted to isolate the problem further which could take forever, but am I missing something obvious?
If anyone is still interested, this saga has a new twist. The problem recurred and this time reloading stats with fullscan didn't make any difference. One slight correction to the way I described it originally is that where it doesn't do an index seek it does a table scan not an index scan. But anyway, it's not using index seek.
A further thing I discovered is that "select * from table where COLA = n and convert(int, COLB) = 1" does generate an index seek.
The only server-side solution I've been able to find is to create an index on COLA only and DROP the COLA/COLB index. (Disabling wasn't enough.)
October 19, 2018 at 1:17 pm
jontris - Friday, October 19, 2018 7:42 AMIf anyone is still interested, this saga has a new twist. The problem recurred and this time reloading stats with fullscan didn't make any difference. One slight correction to the way I described it originally is that where it doesn't do an index seek it does a table scan not an index scan. But anyway, it's not using index seek.
A further thing I discovered is that "select * from table where COLA = n and convert(int, COLB) = 1" does generate an index seek.
The only server-side solution I've been able to find is to create an index on COLA only and DROP the COLA/COLB index. (Disabling wasn't enough.)
Not seeing the plans, the data distribution or anything else... I'm grasping at very tiny straws. Possibility selectivity of the compound index being overly selective... Seriously, without the details, it's so very difficult to say.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 22, 2018 at 10:21 am
Thanks Grant but I'm going to be bold and say that there is no rationale, and the optimiser is just getting it wrong this time. This should not astonish us too much. What is perhaps more astonishing is how much it gets it right.
October 22, 2018 at 11:25 am
jontris - Monday, October 22, 2018 10:21 AMThanks Grant but I'm going to be bold and say that there is no rationale, and the optimiser is just getting it wrong this time. This should not astonish us too much. What is perhaps more astonishing is how much it gets it right.
The optimizer is doing the best it can. I really think to genuinely address this issue, you must cluster the table on the best column(s). Otherwise you will stay in a cycle of issues with this table, queries working well for a while, then suddenly having issues again.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 15 posts - 16 through 30 (of 37 total)
You must be logged in to reply to this topic. Login to reply