March 13, 2013 at 12:46 pm
We moved our server racks in our colo facility over the weekend, upgraded to a new firewall, and installed new switches.
I've got one SQL server (SQL 2008 R2 SP2, Windows 2008 R2 Enterprise) in that facility. Before the move, it was behaving just fine. After the move, I'm seeing blocking and deadlocks on a regular basis. Note that this SQL server is very underutilized - far more hardware than necessary for the load it has.
The interesting part of this is that if I compare SQL server stats from before the move and after, the only stat that's changed is wait times have started to creep up. But I can't spot any good reason for that on the server itself.
What we *are* seeing is data getting *to* the SQL server fine, but a huge delay in the data get back from the server to the calling application. My favorite test case - I fire off a very small update from one application (it updates four records in a table) against a SQL server that isn't in the colo facility, and response time is the same as prior to the move. If I set that same application to look at the SQL server in the colo facility and send the exact same update, it takes 42+ seconds to finish.
But you don't see that 42+ seconds on the SQL server. And if the application times out, the data is still getting written to the server.
We've run hardware diagnostics on the SQL server - nothing reporting an issue.
We've swapped out network cables, and have tested changing switches and even moving back to the old switches that were in place prior to the move. No change in behavior.
I'm seeing far more ASYNC_NETWORK_IO waits than I did prior to the move, but nowhere near enough to account for the sheer lag we're experiencing.
I've checked baseline SQL Server configuration from before and after - no changes to the config.
We did deploy quite a few windows updates - this server hadn't been offline for maintenance in several months.
My network guys are telling me they don't see anything unusual, and me and my SQL folks aren't finding anything pointing to the SQL server directly - everything I'm seeing is pointing to dropped connections over the network.
I'm perfectly willing to hire a consultant to come in and look at the situation - but I'm not even sure what type of consultant to call at this point - a network one or a SQL one.
Any suggestions for where to hunt next would be appreciated. Short of physically removing the server from the colo facility and racking it in another office, I'm kind of out of things to test.
-Ki
March 14, 2013 at 8:48 am
Update: we found the problem. A faulty NIC driver that was updated at the same time. Would have been much easier to find if we hadn't had so many moving parts.
Figured I'd post this in case anyone else runs into something similar.
-Ki
March 14, 2013 at 9:49 am
ASYNC_NETWORK_IO is a clear indicator of network problems, NIC issues, or large data sets being processed over the network. 😉
I am not surprise the Network team said there was no network or NIC problem, lol ... usually, "it is a DBA or database issue" for them ... just kidding ... 🙂
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply