April 4, 2007 at 7:54 am
Two nodes Active\Active MS SQL Server Cluster
Win2k3 enterprise edition, SP1
SQL2k SP4
Once or two times per month we get such situation when one of the cluster nodes unexpectedly tries to restart and hungs. Sometimes hungs the first node, sometimes the second. This time the node B had hunged and its SQL error log doesn't show anything valuable. The Node A SQL error log:
2007-04-04 13:32:17.83 spid2 LogWriter: Operating system error 21(error not found) encountered.
2007-04-04 13:32:17.85 spid2 Write error during log flush. Shutting down server
2007-04-04 13:32:17.85 spid18 Error: 9001, Severity: 21, State: 4
2007-04-04 13:32:17.85 spid18 The log for database 'ABC' is not available..
2007-04-04 13:32:17.86 spid18 Error: 9001, Severity: 21, State: 1
2007-04-04 13:32:17.86 spid18 The log for database 'ABC' is not available..
2007-04-04 13:32:17.88 spid18 Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2007-04-04 13:32:17.88 spid18 Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2007-04-04 13:32:17.97 spid18 Error: 3314, Severity: 21, State: 4
2007-04-04 13:32:17.97 spid18 Error while undoing logged operation in database 'ABC'. Error at log record ID (956831:19771:58)..
2007-04-04 13:32:18.24 spid205 BackLinkLogBlockReadAheadAsync: Operating system error 21(error not found) encountered.
2007-04-04 13:32:23.68 spid2 LogWriter: Operating system error 21(error not found) encountered.
2007-04-04 13:32:23.68 spid2 Write error during log flush. Shutting down server
The thing is when one node hungs, cluster resources groups located on this node doesn't failover to the other node and the Cluster service becomes unavailable too.
I found one interesting MS article but I'm not sure whether it can be the case:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;838765
We have 8GB of RAM on the each server and use the following switch in the boot.ini file:
"/fastdetect /3GB /PAE /NoExecute=OptOut"
Does anyone know what is happening here?
April 4, 2007 at 8:36 am
The system error 21 indicates that when the logWriter would like to write logs, a device is not ready. Here is a link for the system errors. To find more information on your problem, you may try to google the key phrase, the device is not ready.
April 5, 2007 at 5:57 am
Hi,
looks to me like there is a problem with the relation between sql-cluster group and physical drives.
is each physical disc a member of only one cluster group?
had a case once where the tempdb and errorlogs were on a disk that somehow got later to be a member of not the sql server cluster group, but the backup exec cluster group. When the backup exec goup went offline or switched nodes the sql server lost the disk, an so lost tempdb.
That configuration was created by an backup exec cluster specialist...
regards
karl
Best regards
karl
April 10, 2007 at 11:41 pm
Hi,
Yes, they are. Each disk belongs only to one cluster group. As I remmember you cannot add the same disk to different cluster resources groups.
On weekend we updated firmware on these servers and on our SAN. Hope this will help. I'll come back here in case of the same problem
Thanks
April 5, 2008 at 11:44 am
Did this error go . If yes how .
Thanks in advance .
April 10, 2008 at 6:18 am
I'm interested in this error too. We have in our SAP cluster.
In our case, I suppose the disk fails because of a chain of failures starting from a SAP failure.
April 10, 2008 at 6:23 am
Well, much time have passed. We solved the problem at the end. The cause was firmware. After updating them few times problem has gone
April 10, 2008 at 9:02 am
Thank you for the clue.
But we have recently updated all server firmwares...
Was it the firmware of the FC adapter?
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply