April 19, 2013 at 7:30 am
Hi All,
I had an unexpected Auto failover from Principal to Mirror server.
We saw a network spike from 32MB to 1117MB during that period in the reports but the spike was normal during business working hours.
The mirror is configured in HIgh safety with Automatic failover with witness server mode(synch)
One task was happening during that time was copy of 1.8GB compressed backup copy to the principal server.
Does the network spike happens because of this? As we do this all the time, i dont expect this as the issue.
Could not found any specific errors in the log-
The errors we found were as below:
I would like to know why exactly the failover happened. Please someone can help me in analysing the rootcause of this failover.
Error 1:
The command failed because the database mirror is busy. Reissue the command later.
Error 2:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\templogfile\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000514. The offset of the latest long I/O is: 0x000000000b5200
Error 3:
The mirroring connection to "TCP://XXXXXXX:5022" has timed out for database "dbname" after 10 seconds without a response. Check the service and network connections.
April 19, 2013 at 8:16 am
muthyala_51 (4/19/2013)
Hi All,I had an unexpected Auto failover from Principal to Mirror server.
We saw a network spike from 32MB to 1117MB during that period in the reports but the spike was normal during business working hours.
The mirror is configured in HIgh safety with Automatic failover with witness server mode(synch)
One task was happening during that time was copy of 1.8GB compressed backup copy to the principal server.
Does the network spike happens because of this? As we do this all the time, i dont expect this as the issue.
Could not found any specific errors in the log-
The errors we found were as below:
I would like to know why exactly the failover happened. Please someone can help me in analysing the rootcause of this failover.
Error 1:
The command failed because the database mirror is busy. Reissue the command later.
Error 2:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\templogfile\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000514. The offset of the latest long I/O is: 0x000000000b5200
Error 3:
The mirroring connection to "TCP://XXXXXXX:5022" has timed out for database "dbname" after 10 seconds without a response. Check the service and network connections.
The spike may have caused a delay in communication between the principal and witness servers. You may want to increase the timeout for failover from 10 seconds to 30 seconds. We had to do this at a previous employer where I had setup database mirroring as we had issues with our network. It was not the stablest of networks and we had periodic glitches during high volume times.
April 19, 2013 at 11:19 am
But increasing the response time might not give us the actual root cause why it happened.
I am looking more into I/O error what we received- looks to be DISK I/O issue. I have ran the Perfmon counter and saw that the Avg DiskSec/Transfer is >0.015 seconds during File copy
. Can you direct me on this? Thanks.
April 19, 2013 at 11:21 am
One more thing to add, the servers are Virtual (Principal, Mirror and witness).
April 19, 2013 at 11:35 am
muthyala_51 (4/19/2013)
But increasing the response time might not give us the actual root cause why it happened.I am looking more into I/O error what we received- looks to be DISK I/O issue. I have ran the Perfmon counter and saw that the Avg DiskSec/Transfer is >0.015 seconds during File copy
Also noticed during the File copy of file size around 4GB to the one of the disk drives- the SQL server got hang and everything was frozen for couple of minutes and the status of Database on Mirror server were in (Disconnected/In recovery mode), they came to normal state after few minutes. Can you direct me on this? Thanks.
Root cause? Your principal and witness servers were unable to communicate during the timeout period, resulted in the witness making a determination that the prinicapl server was down and initiated a failover to the mirror.
Why? Not enough network bandwidth to communicate due to large data transfer(s) occuring.
Once again, I had this issue at a previous employer, the resolution was to increase the timeout period before a failover occured. This solved the issue of our somewhat instable network causing a failover when there really wasn't a problem. Our automatic failover worked fine when there were real problems with our servers.
April 19, 2013 at 1:51 pm
Lynn Pettis is right. But one variable here is the virtualization of SQL Server. If you have vMotion enabled and due to memory/ CPU ballooning if the Principal or Mirror is moved, this can happen.
I have seen this in our environment and now we have Disable DRS for SQL VMs for that reason only.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply