April 11, 2007 at 5:26 am
We have a number of largish tables which have only unclustered indexes. Is this a sensible design, or do tables without any clustured indexes perform slower when queried.
cheers
Ian
April 11, 2007 at 5:50 am
Sounds facetious, but the answer is that it depends.
If you post a sample create table script with some typical usage queries and data population statistics we can give you some more useful answers.
April 11, 2007 at 5:57 am
In a way that answers my question. I was really wanting to know if there was a hard and fast rule stating that clustered tables will be quicker to query.
If you say 'it depends' then presumably there is no such hard and fast rule.
April 11, 2007 at 6:53 am
well, if you can sensibly cluster on the your core search requirements then there is a good chance of a significant benefit. As I say, it all depends on how you are using it.
April 11, 2007 at 7:14 am
The advantage of clustered indexes is that the data is physically ordered where as other indexes are pointers. When used, you should try to place the clustered index on the field used the most in the queries. I have also heard that they are good for join conditions as well.
Hope that helps
Marvin
Marvin Dillard
Senior Consultant
Claraview Inc
April 11, 2007 at 11:51 pm
You have a number of partial answers. I'll try to be more comprehensive.
First, the allocation behaviour. A table with no clustered index acts as a heap,
which means that *every* insertion goes at the end. Pages are 8k, allocated in
extents of 8 pages, or 64 K. When you delete records from the table, pages
are only ever freed when an entire extent gets emptied. So if you have a table
from which you delete 99 of 100 records over time, it'll grow infinitely and
become very sparse. Thus, my default policy is that *every* table is clustered.
Second, the index behaviour. A record in a non-clustered table is identified by
a RID, which is 6 bytes. Every index entry has a key value and a RID. After
finding the key value you want, the page containing that RID must be loaded.
The index is a tree, and only the leaf pages point to records. Typically you
should arrange that your key values are of a size, and system memory is such,
that a least all the non-leaf (internal) nodes of an index are in cache. Thus,
with a large table accessed by an index, there'll be two disk accesses to fetch a
single record (but less than 2*N to fetch N records).
If a record has been updated and is now larger than the space previously
allocated to it, it might not have been moved and re-indexed. Instead, it might
have been replaced by a small "forwarding" record which gives the new location.
This costs an additional disk access, and is another reason to use a clustered
index - they don't need any forwarding.
When accessing record by a clustered index, the row data is stored in the leaf
pages of the index. Only the key values are stored in internal nodes. Typically
the record will be much bigger than just the key value, so the non-leaf nodes
are few. In particular, the leaf nodes take as much space as if they were stored
in data pages in a non-clustered table, and the internal nodes take as much
space as a normal non-clustered index over the same fields. The system (and
the key value) should be sized so that at least all but the leaf nodes and the
outer key-only nodes (the 2nd-outer layer of the tree) can be in cache. This
takes the same (except for the lack of forwarding) two disk accesses to fetch
one record.
In a clustered index which is on a small key (say, a 4-byte IDENTITY field), it's
likely that *all* the non-leaf nodes fit in cache, so accessing a record takes only
a single real disk access (as it might with a non-clustered index over a small key).
So that works well, just as well as if the same data was stored non-clustered.
Accessing a clustered table using a secondary index costs about the same as if
you had a non-clustered table, *as long as* the clustering key size is small.
This is because the first search in the secondary index returns, not a RID, but a
key value for the clustered index. The clustered index must then be searched
for the record - but in the case of a small key, that's cheap.
So here's the crunch. If you cluster over a large key value, not only does it make
it less likely that the internal nodes of the clustered key aren't in cache (causing
extra reads), but it does the same to *every* secondary index - because these
must contain a key value for *both* the secondary index and the clustered one.
Essentially, using a clustered key value the same size as a RID has about the same
performance as non-clustered, but using a larger key impacts performance by
the *square* of the key size ratio.
A final issue I've observed with some versions of SQL Server is that the query
optimiser *overestimates* the cost of using a 2ary index on a clustered table,
and will often choose a much more costly query plan instead. I don't believe
I've seen this with SQL2KSP4 or anything more recent, so most folk will be ok,
and the rest should upgrade anyhow.
My final advice, based on hard experience, is this:
* Always use a clustered index (for the allocation behaviour)
* Always use a small (4-16 bytes) clustering key.
* Add an IDENTITY field if you don't have a suitable key already.
Clifford Heath, Data Constellation.
April 12, 2007 at 4:49 am
In my experience, Clifford is spot-on. Good advice.
April 12, 2007 at 7:11 am
Thanks Clifford - an excellent summary, explains the issues very clearly.
Ian
April 12, 2007 at 7:18 am
One issue Clifford skipped was that heap tables cannot be reorganized except by truncating and reloading them. Tables with clustered indexes can be optimized by using DBCC DBREINDEX or DBCC INDEXDEFRAG on the clustered index.
The clustered index fields are duplicated in every non-clustered index, so creating a large multi-field clustered index could result in worse performance than using only non-clustered indexes. Using a small clustered index, such as a single int or datetime field, should usually perform at least as well as a heap. A bad clustered index design is using a field that varies randomly for each insertion, such as names or a GUID filled by NEWID(), which causes new rows to be inserted in random locations, which causes page splits and extreme fragmentation. (This can be avoided by regular reindexing using a fill factor large enough to leave room on each page for the insertions.) On the other hand, using an int identity field or a datetime filled by GETDATE() for the clustered index causes new rows to be inserted at the end of the table.
The best performance benefit of clustered indexes is seen when you use a field that appears in a lot of range or inequality filter conditions, assuming they are written correctly. For instance, "WHERE CONVERT(char(7), EnteredOn, 111)='2005/06'" would cause a table scan no matter what the index structure, but "WHERE EnteredOn BETWEEN CAST('2005-06-01' as datetime) AND CAST('2005-06-30' as datetime)" would run very quickly if the table is clustered by EnteredOn. Identity fields are very convenient for primary keys but are usually used to locate single records, whereas dates are more likely to be used in ranges.
April 12, 2007 at 7:35 am
Scott and Cliff have really answered well, so listen to them and thanks for the great answers.
The only other hting I'll add is that the query optimizer works better with clustered indexes on every table. The guys on the SQL Server team expect that tables have clustered indexes when they write their routines. So put a clustered index on, not necessarily on the PK or a FK, but on something.
April 12, 2007 at 8:31 am
Also, be sure to consider how the data gets populated. I'm lucky in the fact that most of my tables are inserted into daily and I have a char(10) YYYY-MM-DD that's my first column of any of the clustered indexes. That way the hit on insert performance is minimized because all of the inserts happen at the end of the table, thus minimizing page splits.
You might want to play around a little with fill factor if the tables doesn't see many deletes. Most of the time you can leave it at the default. I have saved some space by setting it at 100% on my insert-only tables and have the clustered index be on my date column.
Remember, the clustered index actually IS the table, not just a seperate 'table' that references the real data. The columns you specify in the clustered index effect the order in which the data is stored. If the data for the row is changed and one of the columns are in the clustered index, then you may get a performace hit because the row has to be moved.
It all boils down to how the data is accessed and modified. I try and put clustered indexes on all of my tables, but as everyone else said - it depends.
April 13, 2007 at 12:56 am
This thread contains a very good syntheses regarding Clustering.
Also keep in mind that for best practise , you should check that your clustering index key is a unique key !
If it isn't unique, SQLServer will uniquify it by adding a hidden guid kind of thing.
It enlarges the keypart that will be used in all non clustering indexes !
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
July 10, 2007 at 12:11 am
Just to add .. MS has released a nice article "Comparing Tables Organized with Clustered Indexes versus Heaps" in the sql2005 best practises.
you can find it at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/clusivsh.mspx
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
July 10, 2007 at 3:41 am
Don't follow contrived "best practice" if it doesn't suit your business requirements. In the document referenced by Alzdba, under the heading "Test Methodology", they basically admit to cooking the tests because a real-life workload didn't suit.
I've yet to find a system that has been effictively tuned using a different workload than whats expected in the production system.
You might find some of the entries on this blog interesting,
http://blogs.sqlserver.org.au/blogs/Greg_Linwood/
Disclaimer: I do work with Greg and we're both heavily involved in tuning real-life production systems on a day-to-day basis.
--------------------
Colt 45 - the original point and click interface
July 10, 2007 at 5:42 am
As stated in Greg's blog, many times we just overlook the fact that most of the time, a CIX is not choosen for a special reason, but just to have one because of BP-guidelines.
Especialy in those cases, it should be considered and evaluated that changing the CIX may have the effect(s) one seeks.
Document these choices and reasons case per case.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply