Pages/Extents: Clustered vs. Heap

  • I noticed that heaps SEEM to take more I/O for inserts, and require more pages for the same data.  Check out the stats from my process below.  Anyone know off the top of there head what the deal is?

    SET NOCOUNT ON

    GO

    CREATE TABLE [Clustered] ([ID] int identity(1,1), ID2 int)

    GO

    ALTER TABLE [Clustered] ADD CONSTRAINT PK_Clustered PRIMARY KEY CLUSTERED ([ID])

    GO

    CREATE TABLE Heap ([ID] int identity(1,1), ID2 int)

    GO

    ALTER TABLE Heap ADD CONSTRAINT PK_Heap PRIMARY KEY NONCLUSTERED ([ID])

    GO

    SET STATISTICS IO ON

    GO

    /*

    Notice that the I/O stats are going to show an average of 2 reads per insert on the Clustered table

    and an average of three reads per insert for the heap.  Why?

    */

    Declare @counter int

    SET @counter = 0

    BEGIN TRAN

    WHILE @counter < 100000 BEGIN

     INSERT INTO [Clustered] (ID2) VALUES(@counter)

     set @counter=@counter+1

    END

    COMMIT TRAN

    GO

    Declare @counter int

    SET @counter = 0

    BEGIN TRAN

    WHILE @counter < 100000 BEGIN

     INSERT INTO Heap (ID2) VALUES(@counter)

     set @counter=@counter+1

    END

    COMMIT TRAN

    GO

    /*

    -Notice that the heap requires more pages than the clustered index.  Why? 

    -Notice the Extent Scan Fragmentation shows >80% indicating MANY gaps between extents?  Why?

    */

    DBCC SHOWCONTIG ([Clustered])

    DBCC SHOWCONTIG (Heap)

    /*

    Notice that neither INDEXDEFRAG or DBREINDEX changes the page counts for the heap...it still requires

    more than the clustered table.

    Also notice that the extent scan fragmentation has not been resolved through either method.

    */

    DBCC INDEXDEFRAG (tempdb, Heap, PK_Heap)

    DBCC DBREINDEX (Heap)

     

  • In a heap, the data and indexes are processed independently. 

    When you are doing any DML on a heap table, you will get I-Os to any index you use, and then a new stream of I-Os to the data.  If you get a page split when you are inserting data, there is no maintenance tool in SQL Server that can reclaim space on partially empty data pages.  If you aim a index rebuild at a heap table, you will reclaim space in the indexes but not the data.

    If you have a cluster index, then the lowest level leaf pages of the CI are also the data.  If your access is via the CI, then you often 'save' some I-Os compared to a heap, as often the low level leaf pages you need are retrieved in the same extent (i.e. the same I-O) as a higher level index page.  Also, when you aim a index rebuild at a clustered table, the entire table gets re-written in new extents so that any spare space is reclaimed.

    If your access to a clustered table is via a non-cluster index, then you can sometimes see more I-O then if the table was a heap.  This is because you have the I-O stream to navigate the non-cluster index, the bottom level of which now holds the cluster index key, and then you have I-Os to navigate the cluster index to get to the data.

    The actual I-Os experienced by an application using a cluster index or a non-cluster index will depend on the application design and actual data values.  You need to decide if a CI makes good sense for a given table, and if so what should the CI key fields be.  Most people with strong SQL Server experience will say you should assume a CI will be needed unless either the table is very small (under approx 1 MB) or you can show a CI actually harms performance.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Yep, I completely spaced on the PK constraint check...  It was traversing PK_Heap for every insert to verify its uniqueness.  Also, thanks for pointing out that CI's sometime have multiple page types per extent - in my mind this could have both positive and negative effects.  I will have to look at that more closely.

     

    Anyway, I removed the PK from the heap and the average IO's per insert dropped from 3 to 1 since the only IO was reading in the page that was to be modified.

     

    I'm still curious why the heap requires more pages for storage.  The CI/table requires 211.  The heap with no indexes requires 221.  Without any constraints there should only be leaf pages - so why more (I would think it should be less)?

     

    Also, regarding the 80% extent scan fragmentation...  This means that a gap existed on disk between two adjacent extents.  Now, I get that heaps cannot be defragmented (logically) without the CI.  Yet, I'm still curious about this number.  Just like IAM chains are used to traverse heaps, so too could they be used to write heap extents to disk in a contiguous manner based at least on the order of the IAM pages...I would think.  Plus, the insert routine running on a dev server seems to be inclined to write contiguous extents - yet they are not.  So, my question becomes: 1) Is there an idiosyncrasy in the way free extents are allocated to heaps that result in the fragmentation.  2)  Or, is this calculation on a heap naturally flawed because there is no official ordering to a heap yet the algorithm/calculation for extent scan fragmentation tries to interpret an order.  Again, the extent scan fragmentation is based on gaps in adjacent extents - maybe determining "adjacency" is naturally flawed for heaps.

     

    You may wonder why I am even worried about this...  I'm really just trying to understand heaps better. I have a vendor that uses 75% heaps over CI/tables and their explanation is that CI's introduce unnecessary overhead in most cases.  So, while I don't agree with them, I want to understand where they are coming from.  Hence, my interest in the physical implementation and low-level use of heaps.

     

    Ryan

  • I am not sure about the answer to some of your questions, but yes - heaps could be defragmented in the way you suggest, it is just the Microsoft has not chosen to implement anything to do this 

    I have heard vendors say some strange things about what works best with various types of database.  It normally says more about their lack of understanding about that particular DBMS than their deep insight into how it works.  People coming from an Oracle background can have a hard time coping with the difference between Oracle cluster indexes and SQL cluster indexes - Oracle uses the term cluster to spread data across federated servers, while SQL Server, DB2, Sybase use it to physically place data in the sequence of the keys.

    It is certainly the case with SQL cluster indexes the DB designer needs to think about what is the best index to cluster on.  Thinking takes time, which can delay speed to market, so in this way a vendor can claim that putting cluster indexes on SQL tables adds complexity.  All I can suggest is stick with it - if you can show your management and the vendor that your SLAs are at risk without cluster index you have a good lever to get the vendor to add them.  If all you can show is poor response time but within SLAs it could be hard to get much done.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply