January 15, 2009 at 3:58 am
hello all
I'n getting errors trying to access a database:
The operating system returned error 21(error not found) to SQL Server during a read at offset 0x00000000296000 in file 'i:\MSSQL\DATA\xxxxxxxx_Data.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
When I ran a DBCC CHECKDB with physical_only I get the same error again. I'm assuming I've got some bad sectors on this disk as we recently had a multiple disk failure on this array.
Any chance of recovering this? I doubt it.
thanks
Alan Cranfield
thanks
SQL_EXPAT
January 15, 2009 at 4:26 am
Google tells me that Operating System error 21 is "The device is not ready." It's saying that the file can't be accessed at all, not that a couple sectors are damaged
Check the server, make sure that all of the drives that the SQL files are on are accessible and present.
Do not detach that database. Do not restart SQL or reboot the server.
Do you have good backups of that database?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 15, 2009 at 5:32 am
Hi Gail
The I: drive is there and accesible and I can write a file to it. I can even change the name of the data and log files so it looks like SQL hasn't attached to the files:
xp_fixeddrives
GO
xp_cmdshell 'dir i:\mssql\data\'
drive MB free
----- -----------
C 7469
F 80001
G 1065908
I 1146876
output
------------------------------------
Volume in drive I is I_Drive
Volume Serial Number is 94F5-10F0
NULL
Directory of i:\mssql\data
NULL
15/01/2009 12:25 .
15/01/2009 12:25 ..
09/01/2009 23:30 1,572,864,000,000 EQMCLOGTower_Data.mdf
15/11/2008 12:00 524,288,000,000 EQMCLOGTower_Log.ldf
2 File(s) 2,097,152,000,000 bytes
2 Dir(s) 1,202,587,013,120 bytes free
NULL
(12 row(s) affected)
I was thinking of renaming the files, dropping the database and then atttempting to attach the files.
There is the following warning message in the disk controller for this drive. Could be related. I'm thinking that maybe the SQL engine came up before the drive was ready. We had to cold boot this server as well as all the drive shelves after our disk failures last week. This database is the only one that is on the I: drive.
cheers
Alan
thanks
SQL_EXPAT
January 15, 2009 at 5:33 am
oops forgot the put the array controller warning:
"The current array controller had valid data stored in its battery backed write cache the last time it was reset or was powered up. This indicates that the system may not have been shut down gracefully. The array controller has automatically written, or has attempted to write, this data to the drives. This message will continue to be displayed until the next reset or power-cycle of the array controller. "
tks
Alan
thanks
SQL_EXPAT
January 15, 2009 at 6:21 am
-- Cranfield (1/15/2009)
I was thinking of renaming the files, dropping the database and then atttempting to attach the files.
No! Absolutely not. If you do that, there's a good chance that the database will not come back. Do not rename the files, do not delete the database, do not detach the database.
I'm thinking that maybe the SQL engine came up before the drive was ready. We had to cold boot this server as well as all the drive shelves after our disk failures last week. This database is the only one that is on the I: drive.
It could be, but in that case I would expect the DB to be suspect. Is it?
Can you check the error log, see if there are any more messages relating to that database, especially from when the system started up.
Do you have a backup?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 15, 2009 at 7:06 am
as far as I can see from the SQL event log this database recovered successfully after the reboot after the failed disks:
Message
Recovery of database 'EQMCLOGRestore' (8) is 100% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
Message
2 transactions rolled back in database 'EQMCLOGRestore' (8). This is an informational message only. No user action is required.
Message
Recovery is writing a checkpoint in database 'EQMCLOGRestore' (8). This is an informational message only. No user action is required.
Message
Recovery is complete. This is an informational message only. No user action is required.
then an hour later I see loads of these:
Message
Error: 8966, Severity: 16, State: 2.
Message
Unable to read and latch page (1:169729024) with latch type SH. 21(error not found) failed.
A DBCC ran successfully after that:
Message
DBCC CHECKCATALOG (EQMCLOGRestore) WITH no_infomsgs executed by CONNECT\SQLservice found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 3 seconds.
Then the database 'disappeared' and any attempted access results on this message in QA and the Error Log:
Message
The operating system returned error 21(error not found) to SQL Server during a read at offset 0x000002ece7a000 in file 'i:\MSSQL\DATA\EQMCLOGTower_Data.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
The weird thing is I can see the DB in SSMS and also see some of the tables (but not all???). If I right click on the database and select properties then I get the above error again in a message box. Same if I try to connect to that database.
We dont have a backup of this database! But we can rebuild it and bulk insert all the data from flat files, which is a pain.
cheers
Alan
thanks
SQL_EXPAT
January 15, 2009 at 7:14 am
So the drive was accessible when SQL started, but something happened a little later. Can you check the windows event log around the same time as the latch errors started? See if there's anything to hint what happened to start the problems
Right now, any restart of SQL, the server or the database will probably result in the database going suspect, which is significantly harder to fix.
It shouldn't be the case, but can you check the permissions on the database files, make sure that the SQL service account has full rights on those files.
Why no backup?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 15, 2009 at 8:26 am
I can see these controller errors around the time we lost the database:
SCSI bus fault occurred on Storage Box box 0, , Port 1 of
Array Controller in slot 1.
This may result in a "downshift" in transfer rate for one or more hard drives on the bus.
SCSI bus fault occurred on Storage Box box 0, , Port 0 of
Array Controller in slot 1.
This may result in a "downshift" in transfer rate for one or more hard drives on the bus.
On the question of why we dont have backups you could say we have made the executive decision to not backup certain databases. Reasons usually range from a) we dont have the space on disk/tape or tape b) we don't have a long enough maintenance window c) we run LIVE/LIVE e) we can rebuild from script f) we forgot (most likely :D). I can assure we do have backups for our critical databases. We have been caught out on occasion. With 300 SQL servers we tend to have the odd database not backed up - but I'm working on it. In this case the database being down is not critical as its a database thats used to restore archived data for historical queries. The other databases on this box are critical/live databases.
Appreciate the replies
cheers
Alan
thanks
SQL_EXPAT
January 15, 2009 at 8:41 am
You've got corrupt pages in some of the critical system tables - this is why CHECKDB fails right away. These were most likely cause by the bad RAID controller you mentioned earlier. Without a backup there's nothing you can do except to create a new database and export as much data as you can into the new one.
Thanks
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
January 15, 2009 at 8:42 am
-- Cranfield (1/15/2009)
SCSI bus fault occurred on Storage Box box 0, , Port 1 ofArray Controller in slot 1.
This may result in a "downshift" in transfer rate for one or more hard drives on the bus.
SCSI bus fault occurred on Storage Box box 0, , Port 0 of
Array Controller in slot 1.
This may result in a "downshift" in transfer rate for one or more hard drives on the bus.
I don't know enough about drives to know what those mean. Any server people there that you can bounce this off. Especially around how to resolve it.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 15, 2009 at 10:15 am
Thanks, Gail and Paul
I will prepare for the long and arduous rebuild of the database..
I think we've learned our lesson now with spanned volumes i.e. dont use them.
cheers
Alan Cranfield
thanks
SQL_EXPAT
January 21, 2009 at 8:38 am
Your system table is corrupt.
Restore from backup or export as much data as possible 🙁
January 21, 2009 at 9:00 am
actually, rebooting the server and disk shelves brought the database back.. Ha!
--cranfield
thanks
SQL_EXPAT
January 21, 2009 at 4:57 pm
ok - but you've got something going wrong in the IO subsystem that you need to fix - so next time it happens the corruption isn't permanent and debilitating.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
January 22, 2009 at 1:45 am
Agreed. We've now migrated the DB from that old DL740/6400 to a new DL580 G5 with P800 controllers.
cheers
Alan
thanks
SQL_EXPAT
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply