Indexes & statistics interactions

  • I have a DB that is used in ways that are "not normal" for a DB and that has lead to issues under SQL 2005 related to its statistics handling. Turning off auto-create stats and auto-update stats lead to over a 50% performance improvement in the application (this was done in consultation with Microsoft). Since no automatic updating occurs, the application has an intra-day procedure that runs which executes stats updates after a user has made data changes. All tables have clustered indexes on them, a handful of tables have additional non-clustered indexes as well. Now I am trying to go in and figure out what, if any, stats outside of the indexes would actually help the application. But, I'm struggling a bit with my understanding of the interactions between indexes and stats – how do I determine when a stat, outside of the indexes, might help more than it might hurt?

    Examples:

    1.)There is a field, RS, that exists in every table; it is defined as an int. This field only holds one of 3 values, typically 90% or more of records in a table will have the value set to 1 – this becomes 100% during nightly maintenance activities, but during the day additional record additions can change that percentage. The field is included in virtually every query, limiting to RS = 1. The other fields in the WHERE and JOIN clauses in a typical query would be using fields from whatever the clustered and/or non-clustered indexes are on that table. SQL complains about missing stats on the RS field. But, with what I know of the actual distribution of the data, I suspect that this would be a field that a stat on it will cost more to keep updated than any value it may add.

    2.)When a query includes "SomeField IS NOT NULL", SQL typically complains that it has no stats on that field. This is not going to be a field that is in any of the existing indexes on a table. Since the query isn’t looking for a specific value and I know that the vast majority of records have a value there (but for some business rule reasons those without a value are being excluded here), is there any performance/efficiency gain by having stats for such a field?

    (This is similar to case 1, where the typical query is looking for the majority value case for the field at issue, not the minority case).

    3.)A query that did use a clustered index seek, but the query plan also showed missing stats on one of the fields. The field is included in the clustered index.

    Clustered index on: Table_ID, Field_ID, Record_ID, L_ID

    Query: FROM dbo.KeyValueTable WHERE Table_ID = 22 AND Field_ID = @FieldID AND L_ID = @LID AND Record_ID IN (1222, 451) AND RS = 1

    SQL complained of missing stats on Record_ID.

    For this one I can see a case for stats on Record_ID – but this is probably the single largest table in the system so I am hesitant to add additional overhead on maintaining data for this table.

    None of the fields included in the index are modified after the record is created. The Table_ID field is the best "grouping" field for the index, there are roughly 500 Table_ID values for the application; this does not vary. Record_IDs are only unique within a Table_ID, the number belonging to a Table_ID can vary from as few as 2 records to many thousands, they may or may not repeat in other Table_IDs. Field_IDs are unique, there are many Field_IDs, each belongs to only one Table_ID, and the number of Field_IDs can vary as the application modifies field information. For a typical installation there is only one L_ID, there are some installations with 2 - 3, but there aren't any with a large number of different L_IDs.

    A typical query is looking for multiple Record_IDs that belong to a specified Table_ID, Field_ID, L_ID combination. Rarely the Record_ID is determined in an IN clause or based on a variable (Record_ID = @Var), most often it is part of the INNER JOIN criteria linking this table to some other set of data (Record_ID = OtherTable.SomeFieldName).

    I'm not really looking for a black & white answer, more for some general guidelines to limit the number of things I try.

    Other considerations (and yes, I am very aware this is not ideal, but it is what it is) –

    I am on a very tight timeline (about a week), so really don't have much time to do trial & error approach.

    I have a single processor machine with 2Gig of RAM and a single hard drive where I can try different approaches. This does not have enough room for more than 1-3 examples of installations. For testing with a larger installation, then its just 1. I have some limited access to a test environment with dual processors, but it presents set-up challenges that take up time I don't have much of. Production is typically quad-processor, lots of RAM, efficient disk arrays, etc. (i.e. looks nothing like where I am testing).

  • After doing more reading on selectivity, I think i understand more, but am still not entirely sure of how to address my third case.

    Since stats are used to calculate the cost of various query operators, estimating rowcounts, that leads me to conclude that adding stats on any column where I know the "typical" query looks at the majority value is not going to be of real use (my examples 1 & 2).

    My third case I'm still unsure of, since Record_ID alone could, for some cases, actually be more selective than the left most fields of the clustered index. But, the cluster d index as a whole is equally selective (posisbly more for some installations) since only one index entry will likely meet the query criteria.

    Is there some measure of density/selectivity that I can use as a general guide to try to determine if a separate stat on Record_ID will really help SQL to make better estimates?

  • I'm not aware of a formula that will do what you want, although if there is one, I'd suggest swinging by Adam Machanic's blog to look for it. By & large, I haven't found that adding statistics all by themselves, makes that much of a difference in execution plans and query performance. However, experimentation, trial & error, seem to be the way to go, along with some educated guesses (like your idea that 1 & 2 just aren't going to have good distribution of statistics in any way that will prove helpful). The one suggestion I'd make at this point would be to look at the possibility of compound indexes, more than one column to help with the issue. Then, you can use the 1&2 valued column in combination with something else, making the two more selective than either would have been on their own. But that assumes there are combinations that will benefit your queries.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I'm not sure how much it will help, but you could consider the following -

    If the optimizer says that a stat is missing then it would have liked to use it - meaning that there is a benefit to adding it. I'm not sure if you can tell how big the benefit will be.

    The downsides of having the stat are

    a) maintenance (in your case the time it takes to update the stat daily)

    b) non-optimal plans when the stat is outdated

    c) non-optimal plans when the stat is misleading due to the maximum of 200 histogram steps

    So I would say that if you can live with (a) and don't think (b) or (c) will be an issue then add it.

  • "meaning that there is a benefit to adding it"

    Not really. SQL usually wants stats on anything that "might" have benefit. It wants the stats on the data distribution in that column to decide whether it would benefit from using it or not. From what I had been seeing, for SQL that will be any field used in a JOIN or WHERE clause that is not indexed (or one of the left most fields in an index).

    In one case it created stats on a column that was just returned in a select and was not ordered or part of any condition (that is one I'm mildly curious about, but honestly know I won't have time to go back and turn on auto-create stats to see if that happens again or not - that was 2 service packs ago).

  • Kimberly.Hahn (4/8/2009)


    "meaning that there is a benefit to adding it"

    Not really. SQL usually wants stats on anything that "might" have benefit. It wants the stats on the data distribution in that column to decide whether it would benefit from using it or not. From what I had been seeing, for SQL that will be any field used in a JOIN or WHERE clause that is not indexed (or one of the left most fields in an index).

    Doesn't the fact that optimizer thinks it might be useful mean that it is, even if it is purely to allow the optimizer to make a fully-informed decision rather than having to guess? This really comes back to why you had so many stats being auto-created that a problem resulted; just guessing but do you have a lot of adhoc queries each running against different columns?

  • This system doesn't have many ad-hoc queries, reporting is a minor aspect of the day-to-day use. What is there is actually highly consistent, does a lot of data manipulation that includes often emptying key tables, and is just rather complex. The fact that tables are emptied is a big issue for SQL in terms of stats.

    The high level big issue was having over 500 tables in a DB and over a hundred DBs on a server in production simply can't support maintaining the number of stats SQL wanted. If I recall correctly it took only a couple of DBs, each with only a couple of users simultaneously to cause real issues (with both auto-stat settings to on). This testing was done with a Microsoft Field Engineer present to help address the problems that appeared with the migration to 2005.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply