Question on index column ordering

  • Please read the entire post, most of this is just background information:

    I have a very large table that data in frequently inserted into and it get fragmented extremely fast, we do the typical maintenance, but I've been tasked with looking at how to reduce fragmentation so it doesn't get so fragmented.

    I didn't design any of this, so I wouldn't have picked the index keys the way they are.

    The current clustered index is an aid int, bid int, date datetime, those three columns are the one the application most frequently queries against, this becomes extremely fragmented, extremely fast and it is a very large table where "minute by minute" is stored.

    The reason for the fragmentation is while datetime is always increasing, the other two int columns (aid and bid are enums) are repeated, e.g. you would have 60 aid(s), 50 bid(s) for 60 unique dates. (think of it like record, (time interval), date).

    We use SQL Standard so partitioning is not so much of an option.

    When I change the index to be date, aid, bid and insert the same amount of records the fragmentation drops to near nothing (going from 99% to < 1%). I was like ok, problem solved!

    However, I also ran some queries (the same on both tables)

    Example:

    t1 is the old index t2 is the new index, both inserted with the same data

    /* Flush dirty pages from the buffer to the database files. */

    CHECKPOINT;

    GO

    /* Flush the data cache and procedure cache, respectively. For DEV environments only! */

    DBCC DROPCLEANBUFFERS;

    GO

    DBCC FREEPROCCACHE;

    GO

    /* Enable statistics tracking for IO and timings. Remember, SET commands remain enabled during a session until disabled. */

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    GO

    -

    select * from t1 where aid = 3 and bid = 3 and date between '12/22/2011 13:00' and '12/31/2012 23:59'

    GO

    select * from t2 where aid = 3 and bid = 3 and date between '12/22/2011 13:00' and '12/31/2012 23:59'

    GO

    -- End SQL Code

    /* Disable statistics tracking for IO and timings. Remember, SET commands remain enabled during a session until disabled. */

    SET STATISTICS IO OFF;

    SET STATISTICS TIME OFF;

    GO

    ------------------

    The results of statistics I/O (I suppose Logical Reads is a good indicator or throughput?) The CPU stayed pretty much the same, but the logical reads almost doubled!

    (540580 row(s) affected)

    Table 't1'. Scan count 1, logical reads 4538, physical reads 31, read-ahead reads 9104, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 686 ms, elapsed time = 12205 ms.

    (540580 row(s) affected)

    Table 't2'. Scan count 1, logical reads 9041, physical reads 5, read-ahead reads 9042, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 640 ms, elapsed time = 10283 ms.

    So...my question is:

    Why does the order of the index cause the logical reads to increase? The only thing I could think of is that the first column (though it's more selective) is bigger? 4 int versus 8 datetime?

    The size and number of pages are the exact same.

    For index performance tuning am I even on the correct track in looking at logical IO? (We see that the physical IO is less, because it's less fragmented and contiguous.

    Just thought i'd ask for any thoughts!

  • The current clustered index is an aid int, bid int, date datetime, those three columns are the one the application most frequently queries against, this becomes extremely fragmented, extremely fast and it is a very large table where "minute by minute" is stored.

    Please remove the datetime column from your clustered index. I don’t think it adds any value for your queries but it may cause fragmentations with the data load frequency you mentioned.

  • The order of the index causes the reads to increase because SQL can no longer seek optimally. SQL can only seek on left-based subsets of the index key, but as soon as it encounters an inequality (like your date), it can't seek on the columns further in the index. I have a blog post on this, link at the end.

    What I would recommend you do is the following:

    Create the clustered index just on the datetime column. Ascending datetime columns are exceptionally good clustered indexes, low fragmentation, reasonably narrow. (adding an identity as a 2nd column makes them just about perfect).

    Then, add nonclustered indexes for your queries to use (and, if at all possible, get rid of the * in your selects, select just the columns you need, no more.)

    With the queries you've posted, I would create a nonclustered index (maybe even a nonclustered primary key) to replace what the cluster used to be. (aid, bid, date) and include the columns that the queries use (that's why I said get rid of the select *). It will fragment, but given that it's a lot, lot, log smaller than the cluster, it's much easier to fix (and probably less of a problem anyway). It will also likely be faster than using the clustered index (again, providing you can stop that SELECT *)

    Yes, you're on the right track with logical reads, physical reads are from disk, SQL does that once then keeps the data in cache as long as possible. For performance tuning look at logical reads, duration and cpu and make your decisions based on all three. Also, be sure to run multiple tests and discard the first (due to data and plan caching). Your CPU and duration are very similar, so, if you can't fix that select * then go with your revised clustered index (date, aid, bid) as it certainly appears to be good enough.

    Blog post with info on the inequalities: http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

    Also maybe this, lots of links and some high-level info http://sqlinthewild.co.za/index.php/2011/11/11/sql-university-advanced-indexing-indexing-strategies/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Dev (12/23/2011)


    I don’t think it adds any value for your queries but it may cause fragmentations with the data load frequency you mentioned.

    Errr, no.

    It is needed for his queries (it's used in a predicate in the where clause) and fragmentation of indexes is predominately a function of the values of the leading column of the index. Ascending datetime columns have very little fragmentation (as the OP stated based on his tests changing the order of the keys in the cluster)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (12/23/2011)


    Dev (12/23/2011)


    I don’t think it adds any value for your queries but it may cause fragmentations with the data load frequency you mentioned.

    Errr, no.

    It is needed for his queries (it's used in a predicate in the where clause) and fragmentation of indexes is predominately a function of the values of the leading column of the index. Ascending datetime columns have very little fragmentation (as the OP stated based on his tests changing the order of the keys in the cluster)

    You may be right but I hesitate to include datetime columns in Clustered Index. Queries are based on 3 columns so even if we create an index on 2 columns (integer) they will pick the same index.

  • I'm pretty sure I need to have datetime in "some" index, I agree with GilaMonster as it's the main part of the query to retrieve a set of rows. I'll try it both ways just for giggles and post the results.

    Gila, thanks, I forgot about that it's basically going to get dates first. I was really pulling my hair out on that one.

  • dfrome (12/23/2011)


    I'm pretty sure I need to have datetime in "some" index

    You do, yes. Otherwise you'll get secondary filters and that's not optimal. (or, if it's a nonclustered index, key lookups and then filters, which is just horrid)

    Gila, thanks, I forgot about that it's basically going to get dates first. I was really pulling my hair out on that one.

    NP, it catches a lot of people, that's why I wrote the post.

    Can you fix the queries not to do SELECT * and, if so, are they selecting a fairly small subset of the table's columns?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Dev (12/23/2011)


    You may be right but I hesitate to include datetime columns in Clustered Index.

    Then you're ignoring one of the best clustered index options there is

    Queries are based on 3 columns so even if we create an index on 2 columns (integer) they will pick the same index.

    Which is more optimal?

    1) Seek to the start of the range of Aid = @a, bid = @b-2, date between @start and @end and then read until the end of the necessary date range and return the rows

    2) Seek to the start of the Aid = @a, bid = @b-2, read all the rows that match those two values and discard all the read rows that don't match the date filter?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail I agree to you on SEEK operation but I care about following.

    Clustered indexes are not a good choice for the following attributes:

    Columns that undergo frequent changes

    This causes in the whole row to move, because the Database Engine must keep the data values of a row in physical order. This is an important consideration in high-volume transaction processing systems in which data is typically volatile.

    Wide keys

    Wide keys are a composite of several columns or several large-size columns. The key values from the clustered index are used by all nonclustered indexes as lookup keys. Any nonclustered indexes defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key and also the key columns defined for that nonclustered index.

    With Datetime column I doubt on item#1 and partially on item #2.

  • A datetime is 8 bytes, that's the same as a bigint, twice the width of an integer. That's not wide. Multi-column clustered indexes are generally not a good idea, that's why I recommended the OP go with the clustered index that I did. Even his original clustered index is only 16 bytes (2 ints and a datetime), same as a uniqueidentifier.

    A datetime that stores the date inserted (as in this case) is not going to change, that's part of why it's such a good clustered index.

    For the OP's queries he needs an index with aid, bid, date in the index key. That would be best a nonclustered index, but it depends on whether the queries need the entire table or not.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail,

    The nonclustered index is currently a PK ID, as for the table, there are many columns, but it's not OVERLY wide (there's one column that I've already suggested we need to change, because it is updated when the application runs based on certain aid,bid settings. Sometimes the query gets everything from the data (the big picture look), other times if the user only wants to see a "value" the query only selects the needed columns, but regardless the where clause is generally where aid,bid, datetime (between) is used. Now, also to keep in mind, I did a pretty extreme example, generally the rowsets returned are based on hour data, this what the user will frequently query again, so there are less rows returned. This same table (through a view) is also used for reports in which some reports may only return certain columns, where as other reports does the "big picture", in all cases the columns describe that particular bid at that point in time (from datetime). I think i've answered most of your questions and I was thinking along the same lines with having the PK be the clustered index, and the datetime,aid,bid be the nonclustered (unique) index, I might have actually already done that and wasn't satified with the performance (based on logical reads), too much Christmas stuff going on to remember! I'll post the results and let you know. Thanks for all the information especially your blog post, it was very helpful!

    Here's the current table (with nonclustered key), since I already desribed the current clustered key I won't post that again.

    CREATE TABLE [dbo].[t1](

    [Id] [bigint] IDENTITY(1,1) NOT NULL,

    [aid] [int] NOT NULL,

    [bid] [int] NOT NULL,

    [Date] [datetime] NOT NULL,

    [Value] [float] NULL,

    [String] [varchar](32) NULL,

    [Reason] [int] NULL,

    [Action] [int] NULL,

    [Method] [int] NULL,

    [col1] [bit] NOT NULL,

    [col2] [bit] NOT NULL,

    [col3] [bit] NOT NULL,

    [col4] [bit] NOT NULL,

    [col5] [bit] NOT NULL,

    [col6] [bit] NOT NULL,

    [col7] [bit] NOT NULL,

    [col8] [bit] NOT NULL,

    [col9] [bit] NOT NULL,

    [col10] [bit] NOT NULL,

    [col11] [bit] NOT NULL,

    [col12] [float] NULL,

    [col13] [bit] NOT NULL,

    [col14] [bit] NOT NULL,

    [Locked] [bit] NOT NULL,

    [col15] [bit] NOT NULL,

    [col16] [bit] NOT NULL,

    [col17] [bit] NOT NULL,

    CONSTRAINT [t1PK] PRIMARY KEY NONCLUSTERED

    (

    [Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

  • Primary key to clustered and a NC on the 3 columns is probably best, but the problem is that if that NC is not covering (and if all columns are selected it won't be), it'll only be used for very selective queries (read < 0.5% of the table). You may have to go with your revised clustered index (which is not idea, but about the best trade off around)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Yea, if i put those in the NC index, then the number of logical reads increases and it won't use the NC index.

    so PK ID - clustered

    date,aid,bid or aid,bid,date (NC) both have more reads (but it's not using the non-clustered index anyways).

    So it's almost like if the goal is to reduce fragmentation then moving date to be first column of clustered index is the solution.

    However, it increases logical reads.

    A 100% fragmented (thanks to Shrink!, see it is good for something!), still has less logical I/O than a completely defragmented Date,aid,bid clustered index...so maybe at least in this case fragementation really isn't an issue, since most of the queries themselves select all columns in a table? (bad design I know).

    Trying to balance size of db (more indexes may = better performance, but it increases maintenance as well as overall size of the db, then you have to worry about cpu/performance...work is never done.

    My main question was answered though, about the order of the index.

  • Don't just focus on logical reads. Look at logical reads, duration and CPU. If the duration and CPU haven't changed much then maybe that increase in logical reads is an acceptable for low fragmentation, especially if you have lots of memory (fragmentation doesn't cause higher logical reads except from low page density, which shrink won't cause. It's mainly a hit on inserts when pages split and heavier disk activity when fetching the pages from disk the first time)

    It's a trade-off to be sure, there's no perfect solution here if you can't get those queries using few enough columns that a covering nonclustered index is an option.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Dev (12/23/2011)


    Gail I agree to you on SEEK operation but I care about following.

    Clustered indexes are not a good choice for the following attributes:

    Columns that undergo frequent changes

    This causes in the whole row to move, because the Database Engine must keep the data values of a row in physical order. This is an important consideration in high-volume transaction processing systems in which data is typically volatile.

    Wide keys

    Wide keys are a composite of several columns or several large-size columns. The key values from the clustered index are used by all nonclustered indexes as lookup keys. Any nonclustered indexes defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key and also the key columns defined for that nonclustered index.

    With Datetime column I doubt on item#1 and partially on item #2.

    Like Gail said, one of the best uses for a Clustered Index is on a DateTime column especially when it's a fairly static column like "TransactionDate", "StartDate" or some such. Even if you have duplicate dates, it's still great because it will SEEK on the start of a range and scan "consecutive adjacent" rows from there. Of course, if the dates aren't unique, it's good to make the first column a date column and have a second column in the CI containing some unique identifier such as an IDENTITY column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply