November 16, 2021 at 6:29 am
Hello
We have this issue on one of our Prod databases.
The database is fully operational but our DBCC checkdb job is failing with the error below:
Msg 8921, Level 16, State 1, Line 8
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Tempdb is 100% empty and hasn't move at all. It is also set to autogrowth with 64MB per tick on all 4 files and also log to unlimited size.
When run the DBCC it always gets to 46.9% and fails
We see a lot of errors in event viewer> system log as follows:
The device, \Device\Harddisk1\DR1, has a bad block.
There was a problematic disk in the vSAN storage but we have replaced it and checked all storage devices for errors.
In msdb we see 3 suspected pages.
One of the jobs for this databse fails like this:
Time-out occurred while waiting for buffer latch type 2 for page (1:488894), database ID 11.
another job fails with this:
The operating system returned error 23(Data error (cyclic redundancy check).) to SQL Server during a read at offset 0x000000eeb7c000 in file 'D:\MSSQL\MSSQL13.MSSQLSERVER\MSSQL\Data\Somedbnamehere.mdf
Any suggestions on this?
Thanks in advance!
November 16, 2021 at 3:58 pm
To me that sounds like when the disk had bad sectors which resulted in data getting corrupted in the system databases and possibly user databases.
First, I would put it into single user mode and make sure you are the single user. You don't want people going in and changing the data while you are trying to fix things.
Next, I would run checkdb on master and HOPE it comes back clean. Then repeat for model and msdb. I would run it manually just so you can verify things and begin working on recovery.
You have 3 suspect pages in msdb, so I see a few options here. You could restore from a backup, you could run checkdb with fix allow data loss, or you could rebuild and then recreate all the jobs that would disappear with the rebuild. Which option you go with is entirely up to you. I would not recommend the fix with allow data loss because you could end up with jobs that have missing steps. I like the restore from backup OR rebuild and recreate the jobs from the copies of the scripts that you have in source control... you do have your jobs scripts in source control (or in scripts somewhere) so you can recreate them, right?
With the user database, that one is going to be more challenging to fix. I would start by restoring a known (or suspected) good backup onto a test system, re-do checkdb to ensure it is in a good state. At this point, you may have lost some data due to data changes throughout the day. If you can recreate the data, then do so on the test system, then restore it to live (after taking a backup of live). If it is too much data changed OR you don't know or have a way to put the data back in, then I would be looking at some other methods. Since you are getting a data error and it isn't an error on a specific page, that may be harder to correct or may not be possible to correct. I've never had success fixing cyclic redundancy check errors inside or outside of SQL Server. Usually I end up just removing the file/folder with the error if I can. You MAY have success by running chkdsk (the windows tool, not a SQL tool) while the database is offline, but it will take time to complete and MAY result in corrupting the database worse. So take a backup and cross your fingers you can recover the bad data. Worst case, there are a few consultants out there that specialize in database data recovery. They MAY be able to get the database going again with minimal data loss (possibly none), but they do cost a bit of money, and they may not be able to recover everything. It MAY be your bad pages are JUST on the indexes and an index rebuild may fix it. BUT if the corruption is on the clustered index/heap and you rebuild your index, you may have just blown away the only good copy of the data.
Database data recovery can be tricky and needs to be done carefully to prevent permanent data loss, so step 1 is to take a backup of the current state so you can get things back to how they are now in the event you make things worse while trying to fix it. And before making any changes, I recommend restoring that backup to a test system and work on making changes on the backup so you can get back to your current state. Once you have fixed the problems, you can always restore your "fixed" database over-top of the broken one OR you can take the database offline and swap the data files. But I would not do any actual changes on the LIVE broken database just so you don't cause further harm.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
December 7, 2021 at 7:30 am
We managed to fix the issue with restore from last full and around 160 transaction log backups ( since full backups were failing due to the corruption.)
The data file itself we couldn`t copy or archive as it was giving us errors so we had to delete...root cause remained unclear.
No data loss in this case but we had couple hours downtime due to the large amount of log backups to be restored 1 by 1 each taking 40sec to 2 min...
Thanks for suggestions to Mr. Brian Gale ! We did indeed restore the DB on a test server to see if everything is okay and we do not have any corruption from log backups or other issues... 😉
December 7, 2021 at 2:16 pm
My opinion - every time I've seen an "unable to delete" error, or unable to copy, move, archive, etc, it has been due to disk failure. I would run a chkdsk to make sure the disk is not faulty, and make sure to run the check that looks for bad sectors. If bad sectors appear, I would replace the disk ASAP! Even if the disk comes back clean, I would be suspect of the disk or the storage controller.
Database corruption is usually a sign of some form of hardware failure.
EDIT - was just re-reading the original post and it definitely says there is a bad block. I would replace that disk ASAP. 1 bad block can very quickly turn into more. And once you start getting bad blocks, it isn't long before the disk is full of them and unusable.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply