Strange SQL Cluster 2008 situation

Question

Post reply

Strange SQL Cluster 2008 situation

tomes12

SSChasing Mays

Points: 600
More actions
September 6, 2011 at 2:18 pm

#256630

Hi !
Here's a "nice" little situation I've been dealing with for the last couple of weeks.
We (a couple of contractors) installed a "standard" 2-node (active/passive) cluster (SQL Server 2008 Enterprise on Windows Server 2008 Enterprise) as we've done many times before.
The company we did this for this time have a web application that would eventually access hundreds of databases on the SQL cluster, but currently only a few test ones are on it. And even now (with only a couple of testers clicking the web app) they get random "time out" or "network-related ... " errors (you know the kind ... when a sql connection can not be established) but as soon as they hit reload, all is good. These errors have no pattern and they show up completely at random.
We've tried all the tricks, fixes and workarounds we know and also found on the internet ... and here's where we are currently:
-The errors show up only when the cluster services (msdtc and sql) are on Node 1. When we failover to Node 2, all is good. (btw, the two nodes are identical servers, completely identically set up).
-Failing over to Node 1 takes considerably longer than failing over to Node 2.
-And I guess we're mostly suspecting a network "hardware" problem, for which we replaced all the ethernet cables with brand new cat6 ones. Of course, we're left with trying a different NIC card all-together, but I was hoping you guys have some other ideas and/or experiences before we go and recommend spending $400 for a quad-port gigabit NIC card.
Thanks, and sorry if all this is confusing, I'll gladly provide more details if you feel it would help you in "diagnosing".

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 234013 More actions · Answer 1

what about port configs at the switch level?

how is the storage configured (local, SAN, NAS, etc)?

During a manual failover what resources seem to take the longest to come online on node1?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

tomes12 SSChasing Mays Points: 600 More actions · Answer 2

Thanks for your interest.

Port configs are all set at 1 gig,

and the storage is SAN (EMC Clariion CX700).

During manual failover all resources are online pretty much instantly, it's just the SQL Service that takes a while (this only when failing over to Node 1).

Also forgot to mention in my previous post, all cluster validation tests are 100% ok .

okbangas SSChampion Points: 11773 More actions · Answer 3

You should not by chance use teaming of network ports? I've seen this strange behavior a few times, where the problem was unstable teaming.

Ole Kristian Velstadbråten Bangås - Virinco - Facebook - Twitter

Concatenating Row Values in Transact-SQL[/url]

tomes12 SSChasing Mays Points: 600 More actions · Answer 4

I've also seen strange behavior in NIC teaming before, and that's why we took that off of the equation too. So, yes we HAD NIC teaming initially, but we removed it. Unfortunately (and obviously) it didn't resolve our issue ...

Thx for pointing that out anyway.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 5

I've never had a problem using NIC teaming for the public NICs.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

BlueTiger Ten Centuries Points: 1396 More actions · Answer 6

It looks like the sql instance has some delay to reach the SAN storage on node1 . Check the fibre channel or iscsi connection.

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 7

I'm curious, DTC no longer has to be a cluster resource in SQL 2008, in 2005 yes..

Are both nodes on the same switch?

CEWII

BlueTiger Ten Centuries Points: 1396 More actions · Answer 8

BlueTiger

Ten Centuries

Points: 1396

September 28, 2011 at 1:24 pm

#1388041

Did you check HBA card?

tomes12 SSChasing Mays Points: 600 More actions · Answer 9

My thoughts were that if it's a matter of the connection to the SAN (fibre cable and/or 'bad' HBA) we would have the problems all the time, and not just randomly appearing whenever (and not very often). As for the DTC, it is on a clustered resource right now.

The servers have their LAN connections into two switches, not sure of how exactly they're spread out, but the Sys Admin had them set up by any 2 node cluster recommendations, plus additionally he had them changed several times, just to rule out the switch ports as a factor.

Thanks for your interest in this, I really hope we could pin this down to something, because it's still bothering us.

BlueTiger Ten Centuries Points: 1396 More actions · Answer 10

Could you switch the cable between these two nodes and test it? or switch the nic cards

BlueTiger Ten Centuries Points: 1396 More actions · Answer 11

Please post your discovery after you resolved this issue. I am curious to know about this too. I have several SQL Server Clusters running on physical box. I am going to build a new one on VMWare shortly.

tomes12 SSChasing Mays Points: 600 More actions · Answer 12

Latest info:

Since we also almost completely suspected the NIC as 'faulty', we had to change the motherboard completely on the 'problematic' Node, because the NIC was integrated with it.

Unfortunately, after we installed it, and finished with setting up the LAN connections, NIC teaming and all that, we still had the same issue.

BUT,

we removed any NIC teaming on that node and the random errors seem to have stopped appearing!! (at least for now, for the last couple of hours).

I'll definitely end this discussion when this strange issue is finally over with my conclusions, but in the meantime, maybe you guys could think of a reason why NIC teaming would cause so much problems on only one of the nodes and not the other (remember, the two nodes are identical).

Thanks again.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 13

what brand are the onboard NICs, Broadcom, Intel, etc?

have you ensured you are using the latest team drivers for your hardware from the maunfacturer?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Mike P Barron Right there with Babe Points: 720 More actions · Answer 14

I'm still thinking it could be SAN related if it's managed out-of-band using the (teamed) NICs. There may still be even be a pathing/host mapping issue. This kind of problem would present to just one node if you had separate FC switches.