SQL Cluster stops working

Question

SQL Cluster stops working

Vidas Zabinskas

Hall of Fame

Points: 3258
More actions
April 4, 2007 at 7:54 am

#68903

Two nodes Active\Active MS SQL Server Cluster
Win2k3 enterprise edition, SP1
SQL2k SP4
Once or two times per month we get such situation when one of the cluster nodes unexpectedly tries to restart and hungs. Sometimes hungs the first node, sometimes the second. This time the node B had hunged and its SQL error log doesn't show anything valuable. The Node A SQL error log:
2007-04-04 13:32:17.83 spid2     LogWriter: Operating system error 21(error not found) encountered.
2007-04-04 13:32:17.85 spid2     Write error during log flush. Shutting down server
2007-04-04 13:32:17.85 spid18    Error: 9001, Severity: 21, State: 4
2007-04-04 13:32:17.85 spid18    The log for database 'ABC' is not available..
2007-04-04 13:32:17.86 spid18    Error: 9001, Severity: 21, State: 1
2007-04-04 13:32:17.86 spid18    The log for database 'ABC' is not available..
2007-04-04 13:32:17.88 spid18    Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2007-04-04 13:32:17.88 spid18    Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2007-04-04 13:32:17.97 spid18    Error: 3314, Severity: 21, State: 4
2007-04-04 13:32:17.97 spid18    Error while undoing logged operation in database 'ABC'. Error at log record ID (956831:19771:58)..
2007-04-04 13:32:18.24 spid205   BackLinkLogBlockReadAheadAsync: Operating system error 21(error not found) encountered.
2007-04-04 13:32:23.68 spid2     LogWriter: Operating system error 21(error not found) encountered.
2007-04-04 13:32:23.68 spid2     Write error during log flush. Shutting down server
The thing is when one node hungs, cluster resources groups located on this node doesn't failover to the other node and the Cluster service becomes unavailable too.
I found one interesting MS article but I'm not sure whether it can be the case:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;838765
We have 8GB of RAM on the each server and use the following switch in the boot.ini file:
"/fastdetect /3GB /PAE /NoExecute=OptOut"
Does anyone know what is happening here?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

SQL ORACLE One Orange Chip Points: 27807 More actions · Answer 1

The system error 21 indicates that when the logWriter would like to write logs, a device is not ready. Here is a link for the system errors. To find more information on your problem, you may try to google the key phrase, the device is not ready.

http://msdn2.microsoft.com/en-us/library/ms681382.aspx

Karl Klingler SSCertifiable Points: 5869 More actions · Answer 2

Hi,

looks to me like there is a problem with the relation between sql-cluster group and physical drives.

is each physical disc a member of only one cluster group?

had a case once where the tempdb and errorlogs were on a disk that somehow got later to be a member of not the sql server cluster group, but the backup exec cluster group. When the backup exec goup went offline or switched nodes the sql server lost the disk, an so lost tempdb.

That configuration was created by an backup exec cluster specialist...

regards

karl

Best regards
karl

Vidas Zabinskas Hall of Fame Points: 3258 More actions · Answer 3

Hi,

Yes, they are. Each disk belongs only to one cluster group. As I remmember you cannot add the same disk to different cluster resources groups.

On weekend we updated firmware on these servers and on our SAN. Hope this will help. I'll come back here in case of the same problem

Thanks

Hitesh-335315 Old Hand Points: 346 More actions · Answer 4

Did this error go . If yes how .

Thanks in advance .

henry_the_master Valued Member Points: 74 More actions · Answer 5

I'm interested in this error too. We have in our SAP cluster.

In our case, I suppose the disk fails because of a chain of failures starting from a SAP failure.

Vidas Zabinskas Hall of Fame Points: 3258 More actions · Answer 6

Well, much time have passed. We solved the problem at the end. The cause was firmware. After updating them few times problem has gone

henry_the_master Valued Member Points: 74 More actions · Answer 7

Thank you for the clue.

But we have recently updated all server firmwares...

Was it the firmware of the FC adapter?