September 15, 2009 at 1:42 pm
One of servers reported this, when running DBCC Checktable locally:
Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:218034) in object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). Expected value 50_PCT_FULL, actual value 80_PCT_FULL.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 2, text ID 65938128896 is referenced by page (1:318334), slot 0, but was not seen in the scan.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 3, text ID 65938128896 is referenced by page (1:215148), slot 26, but was not seen in the scan.
Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:318334), slot 0, text ID 65938128896 is not referenced.
Msg 8929, Level 16, State 1, Line 1
Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 72057594705870848 (type In-row data): Errors found in off-row data with ID 65938128896 owned by data record identified by RID = (1:215148:26)
CHECKTABLE found 0 allocation errors and 5 consistency errors in table 'aaaaaaa' (object ID 85575343).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (aaaaaaa.dbo.aaaaaaa).
this is a SQL 2005 server, SP2.
Oddly enough, if I run DBCC Checktable, using my desktop's SSMS (SQL 2008) and remote connect to that server, I don't get any error. Why?
The table has a clustered index on a varchar(18) column as PK, a non-clustered unique index on a datetime column, and an image type column, among other columns.
Based on Paul Randal's blog:
I tried:
alter index PK_abcde on dbo.aaaaaaaa REORGANIZE WITH ( LOB_COMPACTION = ON )
The problem still persists.
What's the proper way to fix this? Since I don't get consistent results, I now wonder if the error is actually real.
Thanks,
Jason
September 15, 2009 at 2:15 pm
The 2005/2008 SSMS difference must be an SSMS thing - but that would be a bug that you get no errors. Do you get any results when running from 2008 SSMS? Try not having the NO_INFOMSGS switch and see if you get any informational messages even.
There were a couple of LOB corruption bugs fixed in SP2 but neither results in what you're seeing (see here for details).
Can you run the following:
DBCC CHECKDB (yourdb) WITH ALL_ERRORMSGS, NO_INFOMSGS
and post the results?
I suspect there's other corruption that's linked to these localized problems.
Thanks
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
September 15, 2009 at 2:45 pm
It might not be SSMS thing. I ran DBCC CHECKDB and DBCC CHECKTABLE several times afterwards, both remotely and locally, 80% of time it returns error, but 20% of time it doesn't. Weird.
This is one of the good runs for DBCC CHECKTABLE('namemasked'), with no parameters:
DBCC results for 'aaaaaaaa'.
There are 208419 rows in 1718 pages for object "namemasked".
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
And here's the result of a bad run of DBCC CHECKDB WITH NO_INFOMSGS, ALL_ERRORMSGS
Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:218034) in object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). Expected value 50_PCT_FULL, actual value 80_PCT_FULL.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 2, text ID 65938128896 is referenced by page (1:318334), slot 0, but was not seen in the scan.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 3, text ID 65938128896 is referenced by page (1:227396), slot 26, but was not seen in the scan.
Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:318334), slot 0, text ID 65938128896 is not referenced.
Msg 8929, Level 16, State 1, Line 1
Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 72057594705936384 (type In-row data): Errors found in off-row data with ID 65938128896 owned by data record identified by RID = (1:227396:26)
CHECKDB found 0 allocation errors and 5 consistency errors in table 'maskedname' (object ID 85575343).
CHECKDB found 0 allocation errors and 5 consistency errors in database 'masked'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (masked).
I've scheduled to have the server rebooted tonight, and have server admin to run chkdsk. The inconsistency of the DBCC runs is really annoying.
The hardware is HP ProLiant ML350 G5, probably not best for SQL, but it's in one of our field location, not in the datacenter.
September 15, 2009 at 2:58 pm
This says to me that there's an issue with the I/O subsystem randomly returning corrupt data, or maybe a weird stale-read issue from a disk or RAID controller. After the server reboot (including an I/O subsystem reboot), it'll be interesting to see whether the corruption has gone away or not.
Thanks
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
September 15, 2009 at 3:10 pm
I've asked the server admin to update all HP drivers and agents on the box. Will report the finding.
The odd thing is although when running DBCC CheckDB/Table, the errors don't show up 100% of time, everytime it generated an error, the error always pointed to the same page numbers.
At least it's consistent in this regard;-)
September 16, 2009 at 11:38 am
Here's the status: last night the server admin applied OS and HP driver patches, and ran chkdsk on all hard drives. He then rebooted the box.
The nightly DBCC checkdb run still reported the same errors that I posted previously.
This morning, I manully ran the DBCC CHECKDB on the database and DBCC CHECKTABLE on the table. I ran them 4 times. Two times neither checkdb nor checktable reported any error, while the other two times, they reported the same errors.
The sys.dm_db_index_physical_stats reports there's no fragmentation on the table, so running "alter index with rebuild" has no effect on the table.
Is there anything else that I can try?
September 16, 2009 at 11:50 am
This is very odd.
Can you do an experiment?
1) Run CHECKDB 3 times in a row and report the results.
2) Run CHECKDB 3 times in a row, but run DBCC DROPCLEANBUFFERS right before running CHECKDB. And report the results.
I'm trying to figure out whether CHECKDB is seeing the same version of the page each time. I can't think of any non-determinism in the text checking part of CHECKDB that would account for the weird behavior.
Thanks
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
September 16, 2009 at 11:59 am
This is a production server so I can't run DBCC DROPCLEANBUFFERS during the working hours. I will try these tonight when the store closes.
Thanks,
Jason
September 16, 2009 at 12:13 pm
Thought that might be the case.
One other thing to try - can you backup the database and restore it on a different server and see if the same behavior is there? That would say that the non-determinism is in CHECKDB.
I wish I could get a backup of the database to poke around in! 🙂
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
December 7, 2011 at 3:03 pm
I recall a similar experience about 9 months back. CheckDB would run successfully on DB A, then Fail on B, then fail on A then succeed on C & B. I had about 16 DBs ranging from around 1GB up to around 350GB for a total of 1.5 TB. The client had a large SAN (don't recall the configuration) and we had about 6 logical drives mapped for dedicated database uses (TempDB, Data1, ..Datan, Log).
Turned out there was something on the SAN side related to volume shadows or SAN replication. I don't recall the details but once the SAN admin looked at the results we were getting it took about 30 min to fix. Then DBCC ran fine every time.
I seem to recall Paul R. mentioning that CheckDB uses a modified DB Snapshot and our conjecture was that something on the SAN side was moving/updating pages during the process. Anyway, it won't hurt to get the SAN folks involved.
Good Luck.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply