March 4, 2004 at 7:37 am
I have a question relating to the efficiency of non-clustered index reads as they relate to both Clustered and Heap tables.
I have a table with over a billion rows. Yes, that’s right… Billion with a “B.” This table is inside of a data warehouse and is appended to every night during the nightly warehouse load.
Currently the table is clustered on a non-unique key and has two non-clustered keys.
My question is “Is it more efficient on large tables to maintain a heap table with all non-clustered indexes?”
I ask this specifically because if a non-clustered index’s row locator is a pointer to the row then by transversing the non-clustered index we can go directly to the data rows in question verses getting the clustered key and then transversing the clustered index into to the data.
It sounds right to me but there may be something I don’t know about the internal access mechanisms of SQL Server.
This is directly out of BOL.
“The structure of the row locator depends on whether the data pages are stored in a heap or are clustered. For a heap, a row locator is a pointer to the row. For a table with a clustered index, the row locator is the clustered index key.”
Any comments?
*****************************************************************
Below are BOL extracts relating to Heap Tables and Indexes
Heaps have one row in sysindexes with indid = 0. The column sysindexes.FirstIAM points to the first IAM page in the chain of IAM pages that manage the space allocated to the heap. Microsoft® SQL Server™ 2000 uses the IAM pages to navigate through the heap. The data pages and the rows within them are not in any specific order, and are not linked together. The only logical connection between data pages is that recorded in the IAM pages.
Table scans or serial reads of a heap can be done by scanning the IAM pages to find the extents holding pages for the heap. Because the IAM represents extents in the same order they exist in the data files, this means that serial heap scans progress uniformly down each file. This is more efficient than following the data page chains used in earlier versions of SQL Server, in which the data page chain often takes a somewhat random path through the files of a database. Using the IAM pages to set the scan sequence also means that rows from the heap are not typically returned in the order in which they were inserted.
Index Allocation Map (IAM) pages map the extents in a database file used by a heap or index. IAM pages also map the extents allocated to the ntext, text, and image page chain for any table that has columns of these types. Each of these objects has a chain of one or more IAM pages recording all the extents allocated to it. Each object has at least one IAM for each file on which it has extents. They may have more than one IAM on a file if the range of the extents on the file allocated to the object exceeds the range that an IAM can record.
IAM pages are allocated as needed for each object and are located randomly in the file. sysindexes.dbo.FirstIAM points to the first IAM page for an object, and all the IAM pages for that object are linked in a chain.
An IAM page has a header indicating the starting extent of the range of extents mapped by the IAM. The IAM also has a large bitmap in which each bit represents one extent. The first bit in the map represents the first extent in the range, the second bit represents the second extent, and so on. If a bit is 0, the extent it represents is not allocated to the object owning the IAM. If the bit is 1, the extent it represents is allocated to the object owning the IAM page.
When Microsoft® SQL Server™ 2000 needs to insert a new row and no space is available in the current page, it uses the IAM and PFS pages to find a page with enough space to hold the row. SQL Server uses the IAM pages to find the extents allocated to the object. For each extent, SQL Server searches the PFS pages to see if there is a page with enough free space to hold the row. Each IAM and PFS page covers a large number of data pages, so there are few IAM and PFS pages in a database. This means that the IAM and PFS pages are generally in memory in the SQL Server buffer pool, so they can be searched quickly.
SQL Server allocates a new extent to an object only when it cannot quickly find a page in an existing extent with enough space to hold the row being inserted. SQL Server allocates extents from those available in the filegroup using a proportional allocation algorithm. If a filegroup has two files, one of which has twice the free space of the other, two pages will be allocated from the file with more empty space for every one page allocated from the other file. This means that every file in a filegroup should have a similar percentage of space used.
Microsoft® SQL Server™ 2000 supports indexes defined on any column in a table, including computed columns.
If a table is created with no indexes, the data rows are not stored in any particular order. This structure is called a heap.
The two types of SQL Server indexes are:
Clustered indexes sort and store the data rows in the table based on their key values. Because the data rows are stored in sorted order on the clustered index key, clustered indexes are efficient for finding rows. There can only be one clustered index per table, because the data rows themselves can only be sorted in one order. The data rows themselves form the lowest level of the clustered index.
The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. If a table has no clustered index, its data rows are stored in a heap.
Nonclustered indexes have a structure completely separate from the data rows. The lowest rows of a nonclustered index contain the nonclustered index key values and each key value entry has pointers to the data rows containing the key value. The data rows are not stored in order based on the nonclustered key.
The pointer from an index row in a nonclustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or are clustered. For a heap, a row locator is a pointer to the row. For a table with a clustered index, the row locator is the clustered index key.
The only time the rows in a table are stored in any specific sequence is when a clustered index is created on the table. The rows are then stored in sequence on the clustered index key. If a table only has nonclustered indexes, its data rows are stored in a unordered heap.
Indexes can be unique, which means no two rows can have the same value for the index key. Otherwise, the index is not unique and multiple rows can share the same key value.
There are two ways to define indexes in SQL Server. The CREATE INDEX statement creates and names an index. The CREATE TABLE statement supports the following constraints that create indexes:
When you create an index on SQL Server 2000, you can specify whether the keys are stored in ascending or descending order.
SQL Server 2000 supports indexes defined on computed columns, as long as the expression defined for the column meets certain restrictions, such as only referencing columns from the table containing the computed column, and being deterministic.
A fill factor is a property of a SQL Server index that controls how densely the index is packed when created. The default fill factor usually delivers good performance, but in some cases it may be beneficial to change the fill factor. If the table is going to have many updates and inserts, create an index with a low fill factor to leave more room for future keys. If the table is a read-only table that will not change, create the index with a high fill factor to reduce the physical size of the index, which lowers the number of disk reads SQL Server uses to navigate through the index. Fill factors are only applied when the index is created. As keys are inserted and deleted, the index will eventually stabilize at a certain density.
Indexes not only speed up the retrieval of rows for selects, they also usually increase the speed of updates and deletes. This is because SQL Server must first find a row before it can update or delete the row. The increased efficiency of using the index to locate the row usually offsets the extra overhead needed to update the indexes, unless the table has a lot of indexes.
This example shows the Transact-SQL syntax for creating indexes on a table.
USE pubs
GO
CREATE TABLE emp_sample
(emp_id int PRIMARY KEY CLUSTERED,
emp_name char(50),
emp_address char(50),
emp_title char(25) UNIQUE NONCLUSTERED )
GO
CREATE NONCLUSTERED INDEX sample_nonclust ON emp_sample(emp_name)
GO
Deciding which particular set of indexes will optimize performance depends on the mix of queries in the system. Consider the clustered index on emp_sample.emp_id. This works well if most queries referencing emp_sample have equality or range comparisons on emp_id in their WHERE clauses. If the WHERE clauses of most queries reference emp_name instead of emp_id, performance could be improved by instead making the index on emp_name the clustered index.
Many applications have a complex mix of queries that is difficult to estimate by interviewing users and programmers. SQL Server 2000 provides an Index Tuning Wizard to help design indexes in a database. The easiest way to design indexes for large schemas with complex access patterns is to use the Index Tuning Wizard.
You provide the Index Tuning Wizard with a set of SQL statements. This could be a script of statements you build to reflect a typical mix of statements in the system, but it is usually a SQL Profiler trace of the actual SQL statements processed on the system during a period of time that reflects the typical load on the system. The Index Tuning Wizard analyzes the workload and the database, and then recommends an index configuration that will improve the performance of the workload. You can choose to either replace the existing index configuration, or to keep the existing index configuration and implement new indexes to improve the performance of a slow-running subset of the queries.
March 4, 2004 at 8:02 am
Please avoid double posts.
Although this seems to be the more apprpriate category, we should leave the
thread here
--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply