Strange SQL Cluster 2008 situation

  • Hi !

    Here's a "nice" little situation I've been dealing with for the last couple of weeks.

    We (a couple of contractors) installed a "standard" 2-node (active/passive) cluster (SQL Server 2008 Enterprise on Windows Server 2008 Enterprise) as we've done many times before.

    The company we did this for this time have a web application that would eventually access hundreds of databases on the SQL cluster, but currently only a few test ones are on it. And even now (with only a couple of testers clicking the web app) they get random "time out" or "network-related ... " errors (you know the kind ... when a sql connection can not be established) but as soon as they hit reload, all is good. These errors have no pattern and they show up completely at random.

    We've tried all the tricks, fixes and workarounds we know and also found on the internet ... and here's where we are currently:

    -The errors show up only when the cluster services (msdtc and sql) are on Node 1. When we failover to Node 2, all is good. (btw, the two nodes are identical servers, completely identically set up).

    -Failing over to Node 1 takes considerably longer than failing over to Node 2.

    -And I guess we're mostly suspecting a network "hardware" problem, for which we replaced all the ethernet cables with brand new cat6 ones. Of course, we're left with trying a different NIC card all-together, but I was hoping you guys have some other ideas and/or experiences before we go and recommend spending $400 for a quad-port gigabit NIC card.

    Thanks, and sorry if all this is confusing, I'll gladly provide more details if you feel it would help you in "diagnosing".

  • what about port configs at the switch level?

    how is the storage configured (local, SAN, NAS, etc)?

    During a manual failover what resources seem to take the longest to come online on node1?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thanks for your interest.

    Port configs are all set at 1 gig,

    and the storage is SAN (EMC Clariion CX700).

    During manual failover all resources are online pretty much instantly, it's just the SQL Service that takes a while (this only when failing over to Node 1).

    Also forgot to mention in my previous post, all cluster validation tests are 100% ok .

  • You should not by chance use teaming of network ports? I've seen this strange behavior a few times, where the problem was unstable teaming.



    Ole Kristian Velstadbråten Bangås - Virinco - Facebook - Twitter

    Concatenating Row Values in Transact-SQL[/url]

  • I've also seen strange behavior in NIC teaming before, and that's why we took that off of the equation too. So, yes we HAD NIC teaming initially, but we removed it. Unfortunately (and obviously) it didn't resolve our issue ...

    Thx for pointing that out anyway.

  • I've never had a problem using NIC teaming for the public NICs.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • It looks like the sql instance has some delay to reach the SAN storage on node1 . Check the fibre channel or iscsi connection.

  • I'm curious, DTC no longer has to be a cluster resource in SQL 2008, in 2005 yes..

    Are both nodes on the same switch?

    CEWII

  • Did you check HBA card?

  • My thoughts were that if it's a matter of the connection to the SAN (fibre cable and/or 'bad' HBA) we would have the problems all the time, and not just randomly appearing whenever (and not very often). As for the DTC, it is on a clustered resource right now.

    The servers have their LAN connections into two switches, not sure of how exactly they're spread out, but the Sys Admin had them set up by any 2 node cluster recommendations, plus additionally he had them changed several times, just to rule out the switch ports as a factor.

    Thanks for your interest in this, I really hope we could pin this down to something, because it's still bothering us.

  • Could you switch the cable between these two nodes and test it? or switch the nic cards

  • Please post your discovery after you resolved this issue. I am curious to know about this too. I have several SQL Server Clusters running on physical box. I am going to build a new one on VMWare shortly.

  • Latest info:

    Since we also almost completely suspected the NIC as 'faulty', we had to change the motherboard completely on the 'problematic' Node, because the NIC was integrated with it.

    Unfortunately, after we installed it, and finished with setting up the LAN connections, NIC teaming and all that, we still had the same issue.

    BUT,

    we removed any NIC teaming on that node and the random errors seem to have stopped appearing!! (at least for now, for the last couple of hours).

    I'll definitely end this discussion when this strange issue is finally over with my conclusions, but in the meantime, maybe you guys could think of a reason why NIC teaming would cause so much problems on only one of the nodes and not the other (remember, the two nodes are identical).

    Thanks again.

  • what brand are the onboard NICs, Broadcom, Intel, etc?

    have you ensured you are using the latest team drivers for your hardware from the maunfacturer?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I'm still thinking it could be SAN related if it's managed out-of-band using the (teamed) NICs. There may still be even be a pathing/host mapping issue. This kind of problem would present to just one node if you had separate FC switches.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply