September 6, 2011 at 2:18 pm
Hi !
Here's a "nice" little situation I've been dealing with for the last couple of weeks.
We (a couple of contractors) installed a "standard" 2-node (active/passive) cluster (SQL Server 2008 Enterprise on Windows Server 2008 Enterprise) as we've done many times before.
The company we did this for this time have a web application that would eventually access hundreds of databases on the SQL cluster, but currently only a few test ones are on it. And even now (with only a couple of testers clicking the web app) they get random "time out" or "network-related ... " errors (you know the kind ... when a sql connection can not be established) but as soon as they hit reload, all is good. These errors have no pattern and they show up completely at random.
We've tried all the tricks, fixes and workarounds we know and also found on the internet ... and here's where we are currently:
-The errors show up only when the cluster services (msdtc and sql) are on Node 1. When we failover to Node 2, all is good. (btw, the two nodes are identical servers, completely identically set up).
-Failing over to Node 1 takes considerably longer than failing over to Node 2.
-And I guess we're mostly suspecting a network "hardware" problem, for which we replaced all the ethernet cables with brand new cat6 ones. Of course, we're left with trying a different NIC card all-together, but I was hoping you guys have some other ideas and/or experiences before we go and recommend spending $400 for a quad-port gigabit NIC card.
Thanks, and sorry if all this is confusing, I'll gladly provide more details if you feel it would help you in "diagnosing".
September 7, 2011 at 2:05 am
what about port configs at the switch level?
how is the storage configured (local, SAN, NAS, etc)?
During a manual failover what resources seem to take the longest to come online on node1?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
September 7, 2011 at 2:25 am
Thanks for your interest.
Port configs are all set at 1 gig,
and the storage is SAN (EMC Clariion CX700).
During manual failover all resources are online pretty much instantly, it's just the SQL Service that takes a while (this only when failing over to Node 1).
Also forgot to mention in my previous post, all cluster validation tests are 100% ok .
September 7, 2011 at 1:24 pm
I've also seen strange behavior in NIC teaming before, and that's why we took that off of the equation too. So, yes we HAD NIC teaming initially, but we removed it. Unfortunately (and obviously) it didn't resolve our issue ...
Thx for pointing that out anyway.
September 7, 2011 at 6:55 pm
I've never had a problem using NIC teaming for the public NICs.
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
September 28, 2011 at 11:21 am
It looks like the sql instance has some delay to reach the SAN storage on node1 . Check the fibre channel or iscsi connection.
September 28, 2011 at 11:55 am
I'm curious, DTC no longer has to be a cluster resource in SQL 2008, in 2005 yes..
Are both nodes on the same switch?
CEWII
September 28, 2011 at 1:24 pm
Did you check HBA card?
September 28, 2011 at 2:00 pm
My thoughts were that if it's a matter of the connection to the SAN (fibre cable and/or 'bad' HBA) we would have the problems all the time, and not just randomly appearing whenever (and not very often). As for the DTC, it is on a clustered resource right now.
The servers have their LAN connections into two switches, not sure of how exactly they're spread out, but the Sys Admin had them set up by any 2 node cluster recommendations, plus additionally he had them changed several times, just to rule out the switch ports as a factor.
Thanks for your interest in this, I really hope we could pin this down to something, because it's still bothering us.
September 28, 2011 at 2:07 pm
Could you switch the cable between these two nodes and test it? or switch the nic cards
September 28, 2011 at 2:18 pm
Please post your discovery after you resolved this issue. I am curious to know about this too. I have several SQL Server Clusters running on physical box. I am going to build a new one on VMWare shortly.
September 28, 2011 at 3:00 pm
Latest info:
Since we also almost completely suspected the NIC as 'faulty', we had to change the motherboard completely on the 'problematic' Node, because the NIC was integrated with it.
Unfortunately, after we installed it, and finished with setting up the LAN connections, NIC teaming and all that, we still had the same issue.
BUT,
we removed any NIC teaming on that node and the random errors seem to have stopped appearing!! (at least for now, for the last couple of hours).
I'll definitely end this discussion when this strange issue is finally over with my conclusions, but in the meantime, maybe you guys could think of a reason why NIC teaming would cause so much problems on only one of the nodes and not the other (remember, the two nodes are identical).
Thanks again.
September 29, 2011 at 5:47 am
what brand are the onboard NICs, Broadcom, Intel, etc?
have you ensured you are using the latest team drivers for your hardware from the maunfacturer?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
September 29, 2011 at 7:42 am
I'm still thinking it could be SAN related if it's managed out-of-band using the (teamed) NICs. There may still be even be a pathing/host mapping issue. This kind of problem would present to just one node if you had separate FC switches.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply