Roving corruption

Question

Post reply

Roving corruption

Paul Nevlud

Old Hand

Points: 329
More actions
December 20, 2006 at 1:52 pm

#175300

I recently detected some broken page errors when running DBCC CHECKDB. Here's the actual error
Server: Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 1093578934, index ID 0, page ID (1:717295). The PageId in the page header = (1:717239).
Server: Msg 8928, Level 16, State 1, Line 1
Object ID 1093578934, index ID 0: Page (1:717295) could not be processed. See other errors for details.
There are 560565 rows in 3352 pages for object 'Import'.
I've tried to repair this error using DBCC CHECKTABLE ('Import', REPAIR_ALLOW_DATA_LOSS ). Unfortunately, the CHECKTABLE command doesn't actually detect any error, thus it doesn't repair anything.
Running DBCC CHECKDB again reveals the same problem. So the next thing I tried is to drop & recreate the table. After doing this the corruption seemed to migrate to a different table!?! The first two tables this happened on were relatively minor, but eventually I'll hit some of the major tables. How can this roving corruption be stopped? The source of the problem was a RAID volume rebuild that occurred because one of the volumes became degraded.
Side note: I've pulled a copy of the DB over to a test server (with different RAID), so I can play around and work on this problem without damaging my production system. I have the ability to restore from backup, but I'd like to avoid that if possible. We're only doing simple backups here (flaming & slander are welcome on this point so that I have some additional ammo when I try to get approval to change the backup strategies.)
All help and support on this topic would be greatly appreciated.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

MohammedU SSCoach Points: 19644 More actions · Answer 1

Do you see any errors/warning is system log?

Looks like HW issue...

1. Backup your corrupted db and restore on to another server.

2. Run DBCC CHECKDB against restored db.

3. If you find any corruption repair it as you did before..

4. Run the CHECKDB again against restored db.

5. If you don't see any errors could be HW issue.

6. Run check disk (CHKDSK) on the problematic server.

Move the db to different disk if you have any to prove it is HW issue......

MohammedU
Microsoft SQL Server MVP

Paul Nevlud Old Hand Points: 329 More actions · Answer 2

The production server gives me

Error: 823, Severity: 24, State: 2

I/O error (bad page ID) detected during read at offset 0x0000015e3de000 in file 'D:\MSSQL2000\MSSQL\data\nyseg_main.mdf'.

I know the error is a result of an online RAID array rebuild. I restored a recent backup (that includes the corruption) onto my test server. The test server has a different hard disks, so I know there isn't a problem there.

What is confusing me is that I can't get the corruption to go away, it just migrates from one table to another. Can this mean that the problem is in sysobjects or something?

MohammedU SSCoach Points: 19644 More actions · Answer 3

Because it is disk problem... if you remove the one table from that problematic area on the disk by running the CHECKTABLE...

but when you insert the new data it writes to the same problematic area on the disk again then corrupts the new table or the same table..

You can't control the data where to be written..

Did you try check disk (CHKDSK)?

MohammedU
Microsoft SQL Server MVP

Paul Nevlud Old Hand Points: 329 More actions · Answer 4

I did try CHKDSK, but it didn't identify any data in bad sectors. I'm not sure how this works on RAID volumes, but it didn't identify any problems.

Also the Dell array manager shows all disks are healthy. The trigger point for the array rebuild was that the volume got degraded. After the rebuild, I haven't seen any more hardware related problems. That's not to say they aren't there, I just can't find them.

Jeff Gray SSChampion Points: 10667 More actions · Answer 5

Hi,

Assuming you understand the impact, you might try a DBCC DROPCLEANBUFFERS and re-run the DBCC CHECKDB. I've had to do that with an MSA1000 SAN until I could upgrade the firmware.

In my case, I would get very scary errors like the ones you listed, and the repair operation would strangely not do anything to fix them.

A restart of the SQL SERVER service or DROPCLEANBUFFERS made the problem go away. For a while

jg

MohammedU SSCoach Points: 19644 More actions · Answer 6

Did you use any switches while running CHKDSK?

Is your disk are SAN attached?

MohammedU
Microsoft SQL Server MVP

Paul Nevlud Old Hand Points: 329 More actions · Answer 7

Hi guys,

Sorry for the delay in replying.

1) Yes, I've tried DROPCLEANBUFFERS, I've even restarted the server. This didn't resolve the problem

2) I didn't use any switches when using ChkDsk. I just wanted to run in read-only mode to identify bad sectors on the disk. Nothing showed up.

I should mention that I am troubleshooting on completely different hardware than were the problem originate. I've pulled the DB over to my test server to troubleshoot the issue.

UPDATE: I've identified a torn page in the database that is the source of the problem (I think). The PageID and PageNo don't match. When I run

dbcc traceon(3604) --instructs DBCC commands to display to the screen

dbcc page (dbname, 1,717295,3)

I get a page that identifies itself as ID 717239, not 717295. This page has no data in it.

I'd like to delete this page from the DB file. DBCC commands aren't getting rid of it though when I run DBCC CHECKDB('DBName', REPAIR_ALLOW_DATA_LOSS). Any other suggestions about how to get rid of this page?

Thanks so much for the help so far!

Paul

MohammedU SSCoach Points: 19644 More actions · Answer 8

You can't get rid of a page...either checktable should remove or copy the data to new table and delete the problematic table to resolve this page issue

MohammedU
Microsoft SQL Server MVP

alen teplitsky SSC-Dedicated Points: 30014 More actions · Answer 9

what about running checkdb with allow data loss?

around 6 months back we had corruption and I think that's what fixed it. checktable didn't pick anything up

Balmukund SSCertifiable Points: 6648 More actions · Answer 10

Agree with Md. Uddin. You cannot get rid of the page unless table is dropped (i think).

Are you able to select data from the table? If yes, feel happy, move the data to new table and drop the "bad" table. I feel that you may not as you are getting error for index id = 0 which is data page.

CheckTable is smart enough to identify that if it can be fixed. If there are no recommendations then you are hitting the wall.

Minaz Amin SSChampion Points: 11052 More actions · Answer 11

Agreed with Lakhani, DTS the data from the database to another and ran CHECKDB on it.

Error 823 is generally a logical I/O error.

You can always check the problematic disk by running SQLIOStress utility which will report for the I/O Error.

http://support.microsoft.com/kb/231619/

Minaz Amin

"More Green More Oxygen !! Plant a tree today"

Paul Nevlud Old Hand Points: 329 More actions · Answer 12

Hi everyone, sorry for keeping quiet for the past week on this topic. I really appreciate all the advice. Here is the status so far.

I'm working from Mr. Uddin's guidance mostly. He is right that I can't drop the specific page that I want to. The bummer is that I can query data from the table, but I can't DTS or BCP the whole batch out. As soon as I try to do any large data pump task, it fails on me. Even when I try to avoid the area of data where the bad page is!

The solution that I am working on is to do an object level restore of the single corrupt table (thank you SQL LiteSpeed!). The torn page exists in my interval data table, which is no big deal because I can reload the raw data. (We keep the raw data for a period of time for this reason). By doing this, I won't loose any of the user interactions with the database (like login history, preference changes, setup of reports, etc...).

I'm testing this now in my staging environment. I'll post an update when succeed (or I get stumped again).

To answer a few other posts: 1) Unfortunately CheckTable doesn't give me any help, I don't know why. 2) The physical disks are OK, however this problem was caused by a logical volume rebuild. That problem is getting addressed by the HW guys. 3) DBCC DropCleanBuffers and CheckDB (..., REPAIR_ALLOW_DATA_LOSS) didn't help me either.

Again, thanks to everyone for the support.

Paul