May 14, 2009 at 9:59 pm
Well all the errors look to be caused by the 6 pages that the IO subsystem trashed (the 6 8966 errors). This was definitely caused by the IO subsystem - the errors are that the disk sectors themselves are corrupt at the disk level. No possibility of this being anything other than the IO subsystem.
So - choices. You need to get the database onto a different IO subsystem, and then repair it or restore another copy of it. Your choice is going to depend on your RTO and RPO. If you don't have backups, you're going to have to run repair which is going to delete a bunch of data because of the corrupt pages. You're also going to need to figure out what's wrong with the IO subsystem.
Btw - forgot to say - for very urgent problems, forums aren't the best way to go as you need to wait for a response. Calling Product Support is the fastest way to go.
Let us know what you choose to do!
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
May 15, 2009 at 6:09 am
Thanks a lot paul,
We did the same, we called Microsoft support. Problem resolved. Following were our steps to resolve the problem.
1. We tried running checkdb against the corrupted database(database a) which resides on D:, but it was running so long.
2. We waited for about more than 1 day and finally we got some output.
3. We didn't want to mass with our current database, and we were also not able to do backup of the database thus we shutdown the sql server and tried to copy data file from d: to another h:, but it failed.
4. We used Tivoli storage manager to copy that file from d: to tape, but that also failed.
5. Then we tried to use a third party tool called "Secure Copy" to copy that file from d: to h: and finally that worked. We also copied our log file to h: too.
5. we took sql server up.
6. We tried to attach that files on h: as a new database(database a_1) name but it failed.
7. We have created a new database(database a_2) on h: with different file name. then we took sql server down and renamed those files of database a_1 on h:, same as database a_2 on h:.
8.We start the sql server then and database a_2 went to suspect mode
9.We put database a_2 in emergency mode.
10. Thus we had copy of database a (which is called a_2)
11. As per recommendation by microsoft, we tried to run checkdb with repair_allow_data_loss on database a_2 but it was taking so long..
12. Thus we have cancelled that process and tried to go table by table.
13. we ran checktable against all the tables one by one and found four corrupted tables.
14. Among four, two of them we found the corruption in nonclustered indexes thus we ran rebuild index on them and that fixed it.
15. one another table was not much important for business and our application owners told that they can rebuild it without any problem thus we have truncated that table and ran checktable again on it to confirm.
16. on the last corrupted table we ran checktable with repair allow data loss option and that fixed it...we lost 66 out of 80 million rows on that table..
17. As a precaution, we switched drives from d: to h:.i.e. we shutdown sql services copy all data on d: and moved to h: and then they renamed our h: to d:. then we start sql services.
18. we renamed database a_2 as a and deleted our old corrupted a database.
19. We are in good shape now.
--
Thanks all for your help.
One more question to all Gurus...
As a root cause analysis, we ran checkdisk against h:(which was before our d:) but we found 0 errors. Though our checkdb gave us CRC errors which can be caused by bad block in hardware...Now, as our checkdisk gave 0 errors, how can we do our root cause for CRC errors??
Any help will be highly appreciated.
Thanks again.
May 15, 2009 at 9:19 am
Glad you got it sorted with minimal data loss, and thanks for summarizing.
chkdsk - nothing to say the IO problems aren't transient, although it's strange that the errors have gone. I'd also run disk-level diagnostics rather than just chkdsk, as chkdsk doesn't stress the IO subsystem.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
May 17, 2009 at 3:31 pm
We are currently doing RCA on that problem.
What kind of disk level diagnostic are you talking about? We talk to our server and storage admins and they said that they can't find any problem in disk.
What kind of problem do you think might happened? How can I do diagnostics for that.
Once again here's the error we got when we were trying to backup that database.
2009-05-08 03:33:37.82 Backup Error: 18210, Severity: 16, State: 1.
2009-05-08 03:33:37.82 Backup BackupIoRequest::WaitForIoCompletion: read failure on backup device 'D:\Microsoft SQL Server Data\MSSQL$SQL1\Data\LES.mdf'. Operating system error 23(Data error (cyclic redundancy check).).
2009-05-08 03:33:37.82 Backup Error: 3041, Severity: 16, State: 1.
2009-05-08 03:33:37.82 Backup BACKUP failed to complete the command BACKUP DATABASE LES. Check the backup application log for detailed messages.
2009-05-08 03:33:37.82 Backup Error: 18210, Severity: 16, State: 1.
2009-05-08 03:33:37.82 Backup BackupIoRequest::WaitForIoCompletion: read failure on backup device 'D:\Microsoft SQL Server Data\MSSQL$SQL1\Data\LES.mdf'. Operating system error 23(Data error (cyclic redundancy check).).
2009-05-08 03:33:37.83 spid104 Error: 18210, Severity: 16, State: 1.
2009-05-08 03:33:37.83 spid104 BackupMedium::ReportIoError: write failure on backup device 'TDPSQL-00000BAC-0000'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2009-05-08 03:33:37.85 spid104 Error: 18210, Severity: 16, State: 1.
2009-05-08 03:33:37.85 spid104 BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'TDPSQL-00000BAC-0000'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2009-05-08 03:33:37.85 Backup Error: 18210, Severity: 16, State: 1.
Thanks a lot in advance for any help.
May 18, 2009 at 8:11 am
Get your disk admins to check out the SQLIOSim tool which will simulate a very heavy IO workload and tell them when errors occur.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
Viewing 5 posts - 31 through 34 (of 34 total)
You must be logged in to reply to this topic. Login to reply