DBCC CHECKTABLE found data consistency errors

Question

DBCC CHECKTABLE found data consistency errors

mojo-168709

SSCommitted

Points: 1825
More actions
September 15, 2009 at 1:42 pm

#209933

One of servers reported this, when running DBCC Checktable locally:
Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:218034) in object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). Expected value 50_PCT_FULL, actual value 80_PCT_FULL.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 2, text ID 65938128896 is referenced by page (1:318334), slot 0, but was not seen in the scan.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 3, text ID 65938128896 is referenced by page (1:215148), slot 26, but was not seen in the scan.
Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:318334), slot 0, text ID 65938128896 is not referenced.
Msg 8929, Level 16, State 1, Line 1
Object ID 85575343, index ID 1, partition ID 72057594701611008, alloc unit ID 72057594705870848 (type In-row data): Errors found in off-row data with ID 65938128896 owned by data record identified by RID = (1:215148:26)
CHECKTABLE found 0 allocation errors and 5 consistency errors in table 'aaaaaaa' (object ID 85575343).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (aaaaaaa.dbo.aaaaaaa).
this is a SQL 2005 server, SP2.
Oddly enough, if I run DBCC Checktable, using my desktop's SSMS (SQL 2008) and remote connect to that server, I don't get any error. Why?
The table has a clustered index on a varchar(18) column as PK, a non-clustered unique index on a datetime column, and an image type column, among other columns.
Based on Paul Randal's blog:
http://www.sqlskills.com/BLOGS/PAUL/post/Corruption-bug-that-people-are-hitting-Msg-8914-PFS-free-space.aspx
I tried:
alter index PK_abcde on dbo.aaaaaaaa REORGANIZE WITH ( LOB_COMPACTION = ON )
The problem still persists.
What's the proper way to fix this? Since I don't get consistent results, I now wonder if the error is actually real.
Thanks,
Jason

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Paul Randal One Orange Chip Points: 29544 More actions · Answer 1

The 2005/2008 SSMS difference must be an SSMS thing - but that would be a bug that you get no errors. Do you get any results when running from 2008 SSMS? Try not having the NO_INFOMSGS switch and see if you get any informational messages even.

There were a couple of LOB corruption bugs fixed in SP2 but neither results in what you're seeing (see here for details).

Can you run the following:

DBCC CHECKDB (yourdb) WITH ALL_ERRORMSGS, NO_INFOMSGS

and post the results?

I suspect there's other corruption that's linked to these localized problems.

Thanks

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

mojo-168709 SSCommitted Points: 1825 More actions · Answer 2

It might not be SSMS thing. I ran DBCC CHECKDB and DBCC CHECKTABLE several times afterwards, both remotely and locally, 80% of time it returns error, but 20% of time it doesn't. Weird.

This is one of the good runs for DBCC CHECKTABLE('namemasked'), with no parameters:

DBCC results for 'aaaaaaaa'.

There are 208419 rows in 1718 pages for object "namemasked".

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

And here's the result of a bad run of DBCC CHECKDB WITH NO_INFOMSGS, ALL_ERRORMSGS

Msg 8914, Level 16, State 1, Line 1

Incorrect PFS free space information for page (1:218034) in object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). Expected value 50_PCT_FULL, actual value 80_PCT_FULL.

Msg 8965, Level 16, State 1, Line 1

Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 2, text ID 65938128896 is referenced by page (1:318334), slot 0, but was not seen in the scan.

Msg 8965, Level 16, State 1, Line 1

Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:218034), slot 3, text ID 65938128896 is referenced by page (1:227396), slot 26, but was not seen in the scan.

Msg 8964, Level 16, State 1, Line 1

Table error: Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 71781727326896128 (type LOB data). The off-row data node at page (1:318334), slot 0, text ID 65938128896 is not referenced.

Msg 8929, Level 16, State 1, Line 1

Object ID 85575343, index ID 1, partition ID 72057594701676544, alloc unit ID 72057594705936384 (type In-row data): Errors found in off-row data with ID 65938128896 owned by data record identified by RID = (1:227396:26)

CHECKDB found 0 allocation errors and 5 consistency errors in table 'maskedname' (object ID 85575343).

CHECKDB found 0 allocation errors and 5 consistency errors in database 'masked'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (masked).

I've scheduled to have the server rebooted tonight, and have server admin to run chkdsk. The inconsistency of the DBCC runs is really annoying.

The hardware is HP ProLiant ML350 G5, probably not best for SQL, but it's in one of our field location, not in the datacenter.

Paul Randal One Orange Chip Points: 29544 More actions · Answer 3

This says to me that there's an issue with the I/O subsystem randomly returning corrupt data, or maybe a weird stale-read issue from a disk or RAID controller. After the server reboot (including an I/O subsystem reboot), it'll be interesting to see whether the corruption has gone away or not.

Thanks

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

mojo-168709 SSCommitted Points: 1825 More actions · Answer 4

I've asked the server admin to update all HP drivers and agents on the box. Will report the finding.

The odd thing is although when running DBCC CheckDB/Table, the errors don't show up 100% of time, everytime it generated an error, the error always pointed to the same page numbers.

At least it's consistent in this regard;-)

mojo-168709 SSCommitted Points: 1825 More actions · Answer 5

Here's the status: last night the server admin applied OS and HP driver patches, and ran chkdsk on all hard drives. He then rebooted the box.

The nightly DBCC checkdb run still reported the same errors that I posted previously.

This morning, I manully ran the DBCC CHECKDB on the database and DBCC CHECKTABLE on the table. I ran them 4 times. Two times neither checkdb nor checktable reported any error, while the other two times, they reported the same errors.

The sys.dm_db_index_physical_stats reports there's no fragmentation on the table, so running "alter index with rebuild" has no effect on the table.

Is there anything else that I can try?

Paul Randal One Orange Chip Points: 29544 More actions · Answer 6

This is very odd.

Can you do an experiment?

1) Run CHECKDB 3 times in a row and report the results.

2) Run CHECKDB 3 times in a row, but run DBCC DROPCLEANBUFFERS right before running CHECKDB. And report the results.

I'm trying to figure out whether CHECKDB is seeing the same version of the page each time. I can't think of any non-determinism in the text checking part of CHECKDB that would account for the weird behavior.

Thanks

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

mojo-168709 SSCommitted Points: 1825 More actions · Answer 7

This is a production server so I can't run DBCC DROPCLEANBUFFERS during the working hours. I will try these tonight when the store closes.

Thanks,

Jason

Paul Randal One Orange Chip Points: 29544 More actions · Answer 8

Thought that might be the case.

One other thing to try - can you backup the database and restore it on a different server and see if the same behavior is there? That would say that the non-determinism is in CHECKDB.

I wish I could get a backup of the database to poke around in! 🙂

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Ray Herring SSCertifiable Points: 5533 More actions · Answer 9

I recall a similar experience about 9 months back. CheckDB would run successfully on DB A, then Fail on B, then fail on A then succeed on C & B. I had about 16 DBs ranging from around 1GB up to around 350GB for a total of 1.5 TB. The client had a large SAN (don't recall the configuration) and we had about 6 logical drives mapped for dedicated database uses (TempDB, Data1, ..Datan, Log).

Turned out there was something on the SAN side related to volume shadows or SAN replication. I don't recall the details but once the SAN admin looked at the results we were getting it took about 30 min to fix. Then DBCC ran fine every time.

I seem to recall Paul R. mentioning that CheckDB uses a modified DB Snapshot and our conjecture was that something on the SAN side was moving/updating pages during the process. Anyway, it won't hurt to get the SAN folks involved.

Good Luck.