September 28, 2010 at 12:48 pm
The WidgetDate field is the single most relevant column for searching/querying, however it can change often as I mentioned, so it's a bad candidate for clustered.
Please clarify "often".
Scott Pletcher, SQL Server MVP 2008-2010
September 28, 2010 at 12:49 pm
With all this talk about the clustered index, and churn on the date field, don't forget that all of those numerous non-clustered indexes
My hope was that with a clus index on date you would not need any non clus indexes, since date is the prime factor in searching. Hopefully date will limit the rows to search enough that a non clus index would not be needed or even really useful.
Scott Pletcher, SQL Server MVP 2008-2010
September 28, 2010 at 2:29 pm
WayneS (9/28/2010)
With all this talk about the clustered index, and churn on the date field, don't forget that all of those numerous non-clustered indexes will be converted to carrying the clustered index columns in it (and in the case of the non-unique date field, a "uniqueifier" hidden "column") to be able to go to the proper row. Thus, all of this churn will also affect all of the NC indexes.Edit: and after reading Craig's last post a little closer, I see that this is already covered in the MS quote.
Thanks Wayne, I'm sorry I wasn't clearer about that. I concur, just my phrasing apparently stank. 🙂
All indexes would need an update everytime a clustered index item changes, so these dates would churn (at least datawise) all the subindexes, for each change in the clustered. The ID field keeps that to a minimum.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 29, 2010 at 2:45 am
Hello,
as far as I know a clustered index is best for sequential reading. If you create a clustered index using Attrib_Z, any query like
SELECT * FROM Widgets WHERE Attrib_Z = @someValue
will run fast because all rows with the same value (or close value) in the the clustered field are stored together in few data pages; for a query like
SELECT * FROM Widgets ORDER BY Attrib_Z
this clustered index could be perfect.
For this ID column, if it has no meaning - no other than have an unique key - you will always never read this table ordered by ID, so there is no reason for a clustered index in this field.
If you noticed that WidgetDate, Attrib_K, Attrib_M are the most constraining columns, try to create an index (Attrib_K, Attrib_M, WidgetDate); I put WidgetDate as last field because Attrib_K and Attrib_M conditions are for exact values, but WidgetDate condition is for a range.
About your strategy creating a temp table, you was trying an access plan and you confirmed that accessing data through this new index is a good choice. Now I think is best don't create any temp table, simply define this index and lets SQL Server choose a plan, maybe you are lucky and SQL Server does a good choice.
Regards,
Francesc
September 29, 2010 at 7:45 am
frfernan (9/29/2010)
...as far as I know a clustered index is best for sequential reading. If you create a clustered index using Attrib_Z, any query like
SELECT * FROM Widgets WHERE Attrib_Z = @someValue
will run fast because all rows with the same value (or close value) in the the clustered field are stored together in few data pages; for a query like
...
SELECT * FROM Widgets ORDER BY Attrib_Z
this clustered index could be perfect.
For this ID column, if it has no meaning - no other than have an unique key - you will always never read this table ordered by ID, so there is no reason for a clustered index in this field.
...
1. if you have non-clustered index on Attrib_Z, as long as INDEX SEEK can be used (when required) for the query execution, the query also will "run fast".
2. How often you would use the quiry stated for example ("SELECT * FROM Widgets ORDER BY Attrib_Z" )? What about if table has few millions of rows?
I beleive, I am right to suggests, that most of the queries in the RDB will contain some kind of mechanism to limit number of rows they return (WHERE clause or JOINs). For these queries it will be more important which columns are used in WHERE clause or for JOIN other than in ORDER BY (as sorting better to be performed on the final resultset before output).
Using IDENTITY column and having it as CLUSTERED & PK, makes a lot of sence in most OLTP and OLAP systems. It helps performance of most data insert and retrieve queries. Making your PK independed of any external factors.
There are a lot of theoretical discussions about how "it is wrong to have Identity column, or making Identity as ID", however from practical point of view - it is absolutely perfect and works nice with a lot of technologies used for writing application for RDBs (take any ORM solution as example).
Using any "natural" key as PK can be dangerous, because it is based on the assumption (not the fact) that such key is unique. There is no real guarantee that externally generated ID will never be reused or duplicated. TAX number, CIS or NIN, car engine number or shassi number - as long as your system is not an original system generating them, they all are bad candidates for PK or unique index (clustered or non-clustered. Identity column will defenetly guarantee that in your table, this value will be unique, making it perfect candidate for PK and unique clustered index in most of the cases (Please Note: MOST OF THE CASES - IS NOT ALWAYS).
September 29, 2010 at 8:03 am
Yes, an ident is unique, but that does not in and of itself ever make it the proper choice for a clus index.
The clus index column(s) should always be chosen with regard to what is best for the business need at hand, not based on the idea that an ident is automatically almost always the best clus value.
Scott Pletcher, SQL Server MVP 2008-2010
September 29, 2010 at 9:14 am
Hello Eugene,
oops, I'm seeing I wrote "...you will always never read this table ordered by ID...", what a mistake, I tried to write "...you will almost never read this table ordered by ID...", I hope all people understand my original aim.
Eugene Elutin (9/29/2010)
1. if you have non-clustered index on Attrib_Z, as long as INDEX SEEK can be used (when required) for the query execution, the query also will "run fast".
Yes, of course.
2. How often you would use the quiry stated for example ("SELECT * FROM Widgets ORDER BY Attrib_Z" )? What about if table has few millions of rows?
I beleive, I am right to suggests, that most of the queries in the RDB will contain some kind of mechanism to limit number of rows they return (WHERE clause or JOINs). For these queries it will be more important which columns are used in WHERE clause or for JOIN other than in ORDER BY (as sorting better to be performed on the final resultset before output).
Haha, you are right it is a extreme example, I don't remember last time I did it. But there are other more common examples:
At end of day, if you need to consolidate all movements (money movements, or warehouse movements) for this day - or week, or month -, you probably will start your process from a query like
SELECT * FROM Movements WHERE MovementDate >= @StartingDate ORDER BY MaterialCode, MovementDate
In this case a clustered index by MovementDate can be useful because all rows should be stored together (as you pointed out, here WHERE looks more important than ORDER BY). A clustered index by ClientCode + MovementDate looks also useful, asking for the movements made by a client is a natural query, we must analyze the whole application in order to define the clustered indexes.
What I tried to state is that clustering an index has a goal, keep together rows having close keys, and then read them fast. and in the Jordan question I didn't see the need for it.
There are a lot of theoretical discussions about how "it is wrong to have Identity column, or making Identity as ID"
The eternal question... I have my own criteria but I doesn't know these discussions, drop me a link please, simply for the sake of curiosity.
Yours,
Francesc
September 29, 2010 at 9:55 am
frfernan (9/29/2010)
...At end of day, if you need to consolidate all movements (money movements, or warehouse movements) for this day - or week, or month -, you probably will start your process from a query like
SELECT * FROM Movements WHERE MovementDate >= @StartingDate ORDER BY MaterialCode, MovementDate
In this case a clustered index by MovementDate can be useful because all rows should be stored together (as you pointed out, here WHERE looks more important than ORDER BY). A clustered index by ClientCode + MovementDate looks also useful, asking for the movements made by a client is a natural query, we must analyze the whole application in order to define the clustered indexes.
...
There are a lot of theoretical discussions about how "it is wrong to have Identity column, or making Identity as ID"
The eternal question... I have my own criteria but I doesn't know these discussions, drop me a link please, simply for the sake of curiosity.
Yours,
Francesc
Hi Francesc,
1. Your "Movements" table example looks very familiar to me (and I guess to many who worked with transactional data). From practical prospective I can say that such tables are always better to have some Identity Column and have it clustered! No date, no code - nothing but Identity.
This king of tables are, usually, subject of huge insert's volumes, making need for insert to be fast and lightweight as possible. Clustering on Identity very helpull for this. Having additional indices on Date and Code (and other) columns will help your select queries no much less than having them as part of clustered key. But making them clustered would be very bad from "insert" prospective.
2. Easy! Joe Celko is your man - one of the SQL greate theoretics and also big fun of old techonologies (for some reason he loves to refer to old days of IT> mainframes, punch cards and tapes :-D)!
Links?
Plenty:
http://www.dbasupport.com/forums/showthread.php?t=9915
You will find Joe Celko name (and often his posts) in many discussions about Identity columns on many different SQL related forums including this one.
Just Google: "Joe Celko Identity"
and you will be loaded with info. 😀
September 29, 2010 at 9:59 am
His name is Joe Celko.
Scott Pletcher, SQL Server MVP 2008-2010
September 29, 2010 at 10:07 am
Post edited...
He is so well known that even googling for "Joe Celco" will find him easely! 😉
September 30, 2010 at 8:17 am
Eugene Elutin (9/29/2010)
1. Your "Movements" table example looks very familiar to me (and I guess to many who worked with transactional data). From practical prospective I can say that such tables are always better to have some Identity Column and have it clustered! No date, no code - nothing but Identity.This king of tables are, usually, subject of huge insert's volumes, making need for insert to be fast and lightweight as possible. Clustering on Identity very helpull for this. Having additional indices on Date and Code (and other) columns will help your select queries no much less than having them as part of clustered key. But making them clustered would be very bad from "insert" prospective.
A negative reason for clustered indexes. Add rows always at bottom and forget headaches due to splitting data pages or similar happiness sources (but with more chances to lock problems). This solution is the same than no Identity, no cluster, right?.
September 30, 2010 at 9:10 am
Thanks to everyone for the additional ideas on this issue.
TheSQLGuru: Your post was really helpful. I wasn't aware of how to view physical stats.
scott.pletcher (9/28/2010)
The WidgetDate field is the single most relevant column for searching/querying, however it can change often as I mentioned, so it's a bad candidate for clustered.
Please clarify "often".
I checked with our support people on this. About 10% of the records will change within 1-3 weeks of being entered, but after that they stabilize. Given this information, I might go with a clustered on Date.
scott.pletcher (9/28/2010)
One could also partition the table to reduce the reorg/rebuild time. Presumably at some point older dates become stable and are no longer changed (?).
Exactly right. I've been looking into partitioning lately and it seems we have no choice, given the growth of certain tables. If that speeds up inserts & Date updates, it's another selling point.
scott.pletcher (9/28/2010)
With all this talk about the clustered index, and churn on the date field, don't forget that all of those numerous non-clustered indexes
My hope was that with a clus index on date you would not need any non clus indexes, since date is the prime factor in searching. Hopefully date will limit the rows to search enough that a non clus index would not be needed or even really useful.
There will always be a need for some non-clustered indexes on this table. There are numerous other types of queries against the table (not involving date, e.g. some doing lookups by primary key ID or other factors). So, I gather having several non-clustered indexes is a drawback when there's a clustered on Date, in this case?
So if I go with Date as my clustered, I guess I would create the primary key constraint on ID as a non-clustered. That's how I have it set up now.
September 30, 2010 at 9:19 am
Nonclustered indexes generate additional I/O whether there is a clus index or not. As always, every index adds overhead to a table. That's all I meant.
I was hoping that all -- or nearly all -- of your queries would specify a date / date range, so that a clus index seek would be used, and (almost) no other indexes would be needed.
Again, as I stated originally, if the date churns [/b]a lot[/b], clustered on date may not be acceptable.
However, even then, sometimes you can separate historical from current data, where you can give different FILLFACTORs on each, and just use a very large fillf on the new data. Presumably aged data is much more static, and so could have a lot higher fillf.
Scott Pletcher, SQL Server MVP 2008-2010
October 1, 2010 at 6:34 am
frfernan (9/30/2010)
Eugene Elutin (9/29/2010)
1. Your "Movements" table example looks very familiar to me (and I guess to many who worked with transactional data). From practical prospective I can say that such tables are always better to have some Identity Column and have it clustered! No date, no code - nothing but Identity.This king of tables are, usually, subject of huge insert's volumes, making need for insert to be fast and lightweight as possible. Clustering on Identity very helpull for this. Having additional indices on Date and Code (and other) columns will help your select queries no much less than having them as part of clustered key. But making them clustered would be very bad from "insert" prospective.
A negative reason for clustered indexes. Add rows always at bottom and forget headaches due to splitting data pages or similar happiness sources (but with more chances to lock problems). This solution is the same than no Identity, no cluster, right?.
What lock problems? How it inreases the chances of locks? Which version of SQL Server (or Sybase) you are talknig about?
Do you think that if you have no cluster the records are always added "at bottom"? Having clustered index on Identity is not the same as having no clustered index and Identity column.
As I said before, transactional tables are subjects of havy inserts volumes but actually never should be subject of any update. So, having clustered index on identity (together with high fillfactor) removes possibility of page splitting during insert (without adding chances for lock problems untill you still in SQL6.5 or 7 or its old dad Sybase).
If you have no cluster there is no gurantee
Viewing 14 posts - 16 through 28 (of 28 total)
You must be logged in to reply to this topic. Login to reply