June 6, 2014 at 2:29 am
Hi,
I have a 2 node Windows 2008 R2 SP1 cluster with a single SQL Server 2008 R2 SP2 + CU9 instance installed. It's on VMWARE 5.1. The storage is on virtual disks on a 3PAR SAN datastore. (the Windows guys couldn't do RDM).
With increasing frequency, I'm seeing "Operating system error 170" errors at the primary site and DR. They started happening at similar times.
When DR has the problem, the database mirrors for the affected database go suspended. It randomly affects each of the 5 databases in this instance. We also get a short stack dump. I see several single read or write failures (825), before we inevitably get an 823 error.
When production has the problem, the SQL cluster resource restarts. We don't get a stack dump, and the mirroring resumes okay.
Windows doesn't see the read/write problem. There's nothing in the event logs other than what SQL Server reports. The cluster doesn't go over/restart or anything, just the SQL Server resource. The SAN admin has checked 3PAR logs at production and DR and can find no errors. There are also several other servers sharing the same disks.
So far I've checked:
DBCC CHECKDB (we run a full one every night on all databases) - no errors found
Anti Virus - all exclusions are set
VSS backups - none happening on these servers
Disk keeper - not turned on on these servers
Loaded filter drivers - checked online to see if there are any known conflicts, but can't see any
Presented 3 new drives from a new datastore at production, moved the data/logs/tempdb drives to these new drives. Other drives stayed on the old datastore. This seemed to have worked, but the errors returned last night.
Run a repair on VM tools on each vm.
The 2 SANs are not synched in anyway - we only have database mirroring. They're in completely different geographic locations. DR is identically configured to PROD, but the VMs weren't cloned or anything.
Here's some of the errors:
Prod side:
BackupIoRequest::ReportIoError: write failure on backup device 'T:\MSSQL10_50.sqlvault\MSSQL\Backup\UserDBs\EVVSGxxxxx\20140605_200752_EVVSGxxxxx_1_11.BAK'. Operating system error 170(failed to retrieve text for this error. Reason: 15105).
'TCP://xxxxx.xx.xxxxxxx.co.uk:5022', the remote mirroring partner for database 'EVVSGxxxxxxxx', encountered error 823, status 6, severity 24. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.
DR side:
Error: 823, Severity: 24, State: 6.
Database mirroring will be suspended. Server instance 'xxxxxx\xxxxxx' encountered error 823, state 6, severity 24 when it was acting as a mirroring partner for database 'EVHexxxx'. The database mirroring partners might try to recover automatically from the error and resume the mirroring session. For more information, view the error log for additional error messages.
DBCC CHECKDB results for database reported here:
DBCC CHECKDB (EVVSGxxxxxxxxx) WITH no_infomsgs executed by xxxxx\xxxxx found 0 errors and repaired 0 errors. Elapsed time: 0 hours 7 minutes 0 seconds. Internal database snapshot has split point LSN = 00009429:000e97b5:0004 and first LSN = 00009429:000decf2:0001.
Thanks for taking the time to look.
If any more info is required let me know.
Andrew
June 6, 2014 at 2:46 am
Andrew
The backup failure on T:\MSSQL10_50.sqlvault\MSSQL\Backup\UserDBs\EVVSGxxxxx\20140605_200752_EVVSGxxxxx_1_11.BAK - is that a backup that you have scheduled? At what intervals does this error occur?
John
June 6, 2014 at 2:48 am
Hi John,
That's a scheduled full back we take daily at 8pm. It's a script I wrote that we use on all our SQL Server 2008R2 servers, clustered or not. However, we get I/O failures unrelated to backups too. Sorry, should have been clearer on that one - it's just showing the type of error we get.
thanks,
Andrew
June 6, 2014 at 2:51 am
Looks like SAN issue.
---------------------------------------------------
"Thare are only 10 types of people in the world:
Those who understand binary, and those who don't."
June 6, 2014 at 2:59 am
Andrew
Do you have any other examples of Operating system error 170, please? Whenever you get an operating system error xxx message, go to a command prompt and type [font="Courier New"]net helpmsg xxx[/font]. In this case, you'll see the error means The requested resource is in use. I would guess that you're trying to back up more than one database to the same file at the same time.
Of course, the above isn't going to help you with the other problems you're having.
John
June 6, 2014 at 3:00 am
I would normally be inclined to agree, but it's happening on 2 different SANs (not interconnected), the SAN logs are clean and Windows doesn't seem to notice it
June 6, 2014 at 3:04 am
Andrew
Yes, you will get exactly the same problem if you're using the same script or maintenance plan or whatever to back up your databases. Like I said, this is only going to solve the OS error 170 that you posted.
By the way, I strongly recommend that you back up your databases to a drive separate to that on which your SQL Server binaries, system databases, log files and so on are held.
John
June 6, 2014 at 3:09 am
Hi,
I only get OS error 170. You've just reminded me, I ran procmon and handle.exe to monitor the server and caught the "resource is in use" error, but it didn't show what the other process was. I'd originally suspected AV scanning the file(s), but all SQL extensions and paths are definitely excluded, and auto close isn't on for any of these databases.
It's not just backups though - it happens at random times through the day. This is a dedicated SQL box with only 5 db's on.
Cheers,
Andrew
June 6, 2014 at 3:10 am
John,
The T: drive is dedicated to SQL backups. They're then shipped off to a secure location.
Thanks,
Andrew
June 6, 2014 at 3:14 am
adb2303 (6/6/2014)
I only get OS error 170.
OK, then this is only going to help with the one you posted. Do you have any other examples, please?
John
June 6, 2014 at 3:53 am
Here's a few more
At prod:
05/06/2014 15:41:15
LogWriter: Operating system error 170(failed to retrieve text for this error. Reason: 15105) encountered.
Write error during log flush.
Error: 9001, Severity: 21, State: 4.
The log for database 'EVHxxxx' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
Database EVHxxxx was shutdown due to error 9001 in routine 'XdesRMFull::Commit'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
Recovery completed for database EVHxxxx (database ID 7) in 1 second(s) (analysis 60 ms, redo 602 ms, undo 56 ms.) This is an informational message only. No user action is required.
05/06/2014 16:07:19
'TCP://xxxxxxxdr.xx.xxxxxx.co.uk:5022', the remote mirroring partner for database 'EVxxxxxx', encountered error 823, status 6, severity 24. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.
Error: 1474, Severity: 16, State: 1.
05/06/2014 19:07:31
'TCP://xxxxxxxdr.xx.xxxxxx.co.uk:5022', the remote mirroring partner for database 'EVVSGxxxxxxxxx', encountered error 823, status 6, severity 24. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.
5/06/2014 19:37:32
'TCP://xxxxxxxdr.xx.xxxxxxx.co.uk:5022', the remote mirroring partner for database 'EVxxxxxx', encountered error 823, status 6, severity 24. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.
DR logs:
05/06/2014 22:37:49
Error: 823, Severity: 24, State: 6.
The operating system returned error 170(failed to retrieve text for this error. Reason: 15100) to SQL Server during a write at offset 0x00000041768e00 in file 'E:\MSSQL10_50.xxxxT\MSSQL\Logs\EVxxxxxx.ldf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
(there's then 2 more identical messages before a short stack dump and SQL restart)
05/06/2014 22:07:46
Error: 823, Severity: 24, State: 6.
The operating system returned error 170(failed to retrieve text for this error. Reason: 15100) to SQL Server during a write at offset 0x00000036f84c00 in file 'E:\MSSQL10_50.xxxxQL\Logs\EVxxx.ldf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
(there's then 2 more identical messages before a short stack dump and SQL restart)
That's the pattern at the moment.
Cheers,
Andrew
June 6, 2014 at 3:58 am
Andrew
To me, that looks serious enough to raise a case with Microsoft and send them the stack dump. Other than that, maybe someone who knows more than I do about SAN problems and database corruption will weigh in.
John
June 6, 2014 at 4:00 am
Thanks John, I've just done that very thing. Will update later incase someone else finds it helpful
August 5, 2014 at 8:32 am
Any update? I'm experiencing a very similar issue.
Thanks
August 7, 2014 at 1:30 am
Not a satisfactory resolution, no. Microsoft recommended moving off the clustered VM to a standalone. We haven't had any more disk related problems, despite the server build/sql server version/datastore, etc. being identical. The only difference being, the new one isn't clustered. The root cause wasn't identified.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply