How are RANGE_HI_KEYs chosen?

  • I've been poking around at statistics, in an effort to be less dumb.

    One thing that always strikes me is how there's seemingly no uniformity in the histograms for my tables. I keep stats pretty up to date with nightly maintenance.

    What I'm not sure about is how histograms get created. I can't pick up a pattern in my data that makes this histogram make sense. Can anyone explain, or point me to a resource that does? I can't find anything that answers my question the way I'm phrasing it, so maybe I'm asking the wrong question. I've read that there's a 200 bucket max for a table/index. Does SQL do something to minimize the number of buckets in a histogram based on... something? This seems like remarkably few buckets for the size of the table

    The histogram below is from a table with 6149132 rows, on the PK/clustered index.

    Thanks

    RANGE_HI_KEYRANGE_ROWSEQ_ROWSDISTINCT_RANGE_ROWSAVG_RANGE_ROWS

    1003663 0 380 1

    1048182 22057 111973 1.842228

    1188504 279944 4140321 1.995026

    1201233 49815 1012728 3.913812

    1242621 162301 341387 3.921546

    1524233 1076179 3281222 3.826795

    1532024 27542 27790 3.535558

    1542402 32332 110377 3.115737

    6041219 4498234 14498231 1.000001

    6041827 661 4607 1.088962

  • Quick thought, the histograms represents information on data distribution and frequency, the optimizer preserves what it regards as the most useful of those and reduces the number of buckets (steps) for larger tables. Check out DBCC SHOW_STATISTICS (BOL) for details.

    😎

  • Yeah! I was reading that, and this part would totally make the non-uniformity make sense if it were grouping on ALL the columns in the index, not just the first:

    Histogram

    A histogram measures the frequency of occurrence for each distinct value in a data set. The query optimizer computes a histogram on the column values in the first key column of the statistics object, selecting the column values by statistically sampling the rows or by performing a full scan of all rows in the table or view. If the histogram is created from a sampled set of rows, the stored totals for number of rows and number of distinct values are estimates and do not need to be whole integers.

    To create the histogram, the query optimizer sorts the column values, computes the number of values that match each distinct column value and then aggregates the column values into a maximum of 200 contiguous histogram steps. Each step includes a range of column values followed by an upper bound column value. The range includes all possible column values between boundary values, excluding the boundary values themselves. The lowest of the sorted column values is the upper boundary value for the first histogram step.

    But, since it just uses the first key column, why doesn't it shoot for more even distribution across the histogram? So like, if my brain were the query engine, I would have a much easier time writing down a query plan based on evenly distributed histogram values. Especially if the query required values from overlapping ranges. It just seems peculiar to me.

    Thanks

  • The average range rows and the equivalent range rows seem to be pretty accurately distributed. They're extremely close to being roughly the same, except for that 38 outlier.

    The statistics are supposed to be balanced, generating up to 200 steps that are evenly distributed across the data. But, the data is never evenly distributed, so the statistics never are either. But, in your case, it's pretty darned close. To my knowledge, the exact mechanism for arriving at the 200 steps is secret sauce inside the cardinality estimation process within SQL Server.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (10/9/2014)


    The average range rows and the equivalent range rows seem to be pretty accurately distributed. They're extremely close to being roughly the same, except for that 38 outlier.

    The statistics are supposed to be balanced, generating up to 200 steps that are evenly distributed across the data. But, the data is never evenly distributed, so the statistics never are either. But, in your case, it's pretty darned close. To my knowledge, the exact mechanism for arriving at the 200 steps is secret sauce inside the cardinality estimation process within SQL Server.

    Turtles all the way down, I'm sure.

  • sqldriver (10/9/2014)


    Grant Fritchey (10/9/2014)


    The average range rows and the equivalent range rows seem to be pretty accurately distributed. They're extremely close to being roughly the same, except for that 38 outlier.

    The statistics are supposed to be balanced, generating up to 200 steps that are evenly distributed across the data. But, the data is never evenly distributed, so the statistics never are either. But, in your case, it's pretty darned close. To my knowledge, the exact mechanism for arriving at the 200 steps is secret sauce inside the cardinality estimation process within SQL Server.

    Turtles all the way down, I'm sure.

    Since it's all a series of guesses, that's a good a guess as any.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply