February 24, 2009 at 7:28 pm
Hey everyone,
Got (what I think is) a tricky issue for any takers:
Without warning performance dropped off a cliff on one of our servers this week. Further analysis revealed that all STATS objects on one of our main tables were...um...corrupted is the best word I can use to describe it. Therefore, the optimizer was choosing terrible plans. I fixed the objects by rebuilding the offending clustered index and performance became acceptable. Now, I'm trying to determine why the stats objects (and the row count for the clustered index) were so messed up, and what might have potentially caused the scenario. I've got a backup of the DB in the erroneous state, so I can do any debugging you all can point me to...
Here's the output of a query against sys.dm_db_partition_stats for the table in question:
partition_id object_id index_id partition_number in_row_data_page_count row_count
-------------------- ----------- ----------- ---------------- ---------------------- --------------------
72057594160414720 624721278 1 1 13039 0
72057594161070080 624721278 2 1 861 254965
72057594160283648 624721278 3 1 973 254965
the non-clustered row counts are right...the clustered one is obviously not. SELECT COUNT(*) confirms the row count at 250K.
In addition, the Statistics objects aren't right: e.g.
select STATS_DATE(object_id('dbo.
produces NULL (and I'm sysadmin so I definitely have rights).
DBCC SHOW_STATISTICS produces no output:
Name Updated Rows Rows Sampled Steps Density Average key length String Index
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------
_WA_Sys_00000002_253C7D7E NULL NULL NULL NULL NULL NULL NULL
(1 row(s) affected)
All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(0 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
(0 row(s) affected)
This output of STATS_DATE and SHOW_STATISTICS is true for any stats object on the table, whether index-created or auto-created by SQL for non-indexed columns.
Running UPDATE STATISTICS against the stats objects finishes immediately but doesn't actually affect the stats object; i.e., STATS_DATE still returns NULL and SHOW_STATISTICS still returns this (essentially empty) result set.
I wonder if the UPDATE STATISTICS command was actually doing anything...I speculate that SQL saw that the table had "0 rows" via the DMV (or rather, the internal data structure that the DMV exposes) and just aborted the UPDATE STATISTICS run without really doing anything.
I ultimately resolved this via an ALTER INDEX...REBUILD of the clustered index, which I assume flushed out any of the internal metadata objects the DMVs are built on, and then re-constructed them with correct counts.
(by the way, we are running a nightly maintenance job that consists of a REBUILD INDEX on all the indexes in the database, then an update of all stats. This maint job did not fail, but as I demonstrated with the manual UPDATE STATISTICS, didn't really catch anything).
This bug, fixed by MS in 2008, is similar, but our table is not partitioned.
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=329164
So, anyways, I guess I'm looking for "root cause", or for more advanced troubleshooting that I can do, in order to tell the decision-makers that we have root-cause and a way to prevent it. Any ideas you have would be great!
Thanks,
Aaron
February 24, 2009 at 7:39 pm
Correction:
The maintenance plan starts around 11:00pm, nightly.
First, the REORGANIZE Index task runs. (we could change to Rebuild if appropriate)
Then, the Update Statistics task runs
then backup job runs.
Sunday Night's backup job contains the "corrupt" state...Friday night's backup contains a correct state. Minimal work happens on the weekend.
I suppose it is possible the Reorg step introduced a problem...or it could have been one of our Bulk Imports that run on the weekend. Again, looking for any ideas you might have for root cause or advanced troubleshooting rather than an authoritative answer, since that will be pretty difficult in this situation
February 24, 2009 at 7:43 pm
What happens when you run UPDATE STATISTICS {tablename} WITH FULLSCAN ?
* Noel
March 3, 2009 at 10:49 am
Hey Noel,
thanks for answering...
no difference.
March 3, 2009 at 1:01 pm
Sorry I should have posted: UPDATE STATISTICS {tablename} WITH FULLSCAN,ALL
* Noel
March 3, 2009 at 1:42 pm
thanks again...yeah, I ran that command as soon as I realized all the stats objects were "corrupted" or "non-operational" or whatever.
because SQL showed the clustered index as having 0 rows, I speculate that the UPDATE STATISTICS command took some exit branch early in its code before it actually started scanning the table rows...obviously, don't have a way to prove that however...
March 3, 2009 at 1:49 pm
Is the value updated now and not zero. Interesting thread.
You could update the database and set to auto update statistics sync set.
March 3, 2009 at 1:54 pm
Thanks Tracey,
yeah, the DB is set for Auto Update Stats (synchronous) already... again, I don't think it is a problem with the UPDATE STATISTICS logic (either manually executed or executed by SQL's optimizer) as much as it is a problem with some internal metadata object somewhere...(the metadata object that sys.dm_db_partition_stats and dbo.sysindexes pull from)
I'm totally at a loss to even explain how that corruption would occur in the first place...seems like one of those "perfect storm" situations where things must have happened in just the right order...
If it happens again, we'll go to Microsoft...but because of PHI in our dbs, we have a bunch of legal hoops to jump through before that can be an option.
March 3, 2009 at 2:48 pm
Glad it working now for you...and you figured it out what problem was good find...
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply