Persistent Connection Storm Mystery

Question

Post reply

Persistent Connection Storm Mystery

lrapoport

SSC Rookie

Points: 48
More actions
February 1, 2008 at 2:14 pm

#69555

We have several hundred servers contacting each of our two national data centers once a minute to run a very simple query. Suddenly a few days ago, the SQL Server instance at one of the centers stopped dropping connections. Within a few hours, we reached our maximum of roughly 32K connections and things ground to a halt.
Restarting SQL Server of course cleaned everything out and things were fine again - until about 12 hours later connections once again stopped tearing down. We patched things up with a regular job that kills the old sessions, but once the storm starts, it doesn't go away until we restart SQL Server.
We note that:
The persistent connection storm only occurs at one of the data centers. They both run SQL Server 2000. The 'bad' instance is running 8.00.760, the 'good' one 8.00.818 (one security patch away).
In terms of network setup, etc., the two centers are very similar.
We know of no recent upgrades or other changes to code or network configuration at the 'bad' center.
The connection (and other) code run by the servers is exactly the same, whatever their 'default' database.
It certainly seems as if some type of handshake suddenly decides to stop working. The connection persists on the server, the client calls back one minute later, that connection persists, and so on, and so on. Then we restart SQL Server and everything's fine for a while.
This has just started happening and so we are still running traces, etc., to see if we can home in on an undropped connection. But in the meantime, we could use a few good opinions! Our situation is a little rare, I suppose, given the large number of individual servers and their connection rate.
Though this has been working well for years, the number and rate of connections has gradually increased over time.
Thanks!
Larry R.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

SQL ORACLE One Orange Chip Points: 27807 More actions · Answer 1

SQL ORACLE

One Orange Chip

Points: 27807

February 3, 2008 at 5:45 pm

#774614

Did you check timeout setting?

lrapoport SSC Rookie Points: 48 More actions · Answer 2

Response is still the default of 20 minutes. We've considered lowering it to less than a minute, since all the normal queries the machine receives are quite small, but were waiting to see if we could spot where the connection was dropping. At a guess, it looks like the query is responded to properly and in time, but that the handshaking fails at some point after that.

We'd suspect network issues, but once the failures start, they never go off again until we restart SQL.

We also note that the maximum response times in both data centers are the same.

There's apparently some new ability to detect and discard orphaned connections in SQL Server 2005, but installing that would take a round of testing, etc., and would still be merely symptomatic. If we can, we naturally want to know what's causing these things in the first place!

Regards,

Larry R.

aureolin Hall of Fame Points: 3520 More actions · Answer 3

Has there been any changes in the client software? (Installations, configuration changes, etc.) Also, are your clients segmented so that certain clients only contact certain servers? This sounds awfully like the client is not properly closing the connection, and while cutting off 'dead' connections on the server is really only treating the symptoms, sometimes you need to do save the patient while you try to diagnose the problem! 😀

lrapoport SSC Rookie Points: 48 More actions · Answer 4

No recent changes to client or server software. Certain clients do contact certain servers by default (basically, we split the country in two).

This did seem to occur after a reboot on the affected server, but we don't believe any upgrades kicked in at that time. And, the so-called 'good' server looks like it is also experiencing the same issue, albeit to a much lesser extent.

We've been able to determine that TCP/IP thinks the sessions are open, which tends to let SQL Server off the hook. But we're still trying to establish definitively that the client in an orphaned session actually originally connected in that session. It rather looks like it does, and that TCP/IP and the client application disagree about whether or not the session is still valid.

At the moment, we're hoping to get lucky with traces while we schedule some upgrades.

We are managing the traffic by occasional manual restarting of SQL Server, and by a stored procedure that kills the orphaned sessions after an hour. I was just wondering if others had experienced the same kind of behavior. I think the key is that some hours after each restart of SQL Server, some type of threshold is reached. Orphaned processes begin to show up and then proliferate until we are again forced to restart.

aureolin Hall of Fame Points: 3520 More actions · Answer 5

Yes, I agree that it appears that the problem appears after a certain threshold is reached - the question is which system in the chain is having the problem?

Some things you've probably already thought of: Are the same clients causing the open sessions? Were the open sessions routed through a particular switch (or brand of switch) that might not be catching the session close? Could you have a separate issue with a high volume of activity on a switch or link in the network where packets may be getting dropped?

I'd suggest including the NIC in the affected server in your list of suspects, even though you mention that your 'good' server is also affected.

One final thought: run your 'open session killer' at a much smaller interval (a few minutes) and have it count the number of sessions it closes. Whether this is a constant, is slowly increasing or has spikes (and when!) can help you understand what might be going on.

Steve G.

S Hodkinson SSCertifiable Points: 7529 More actions · Answer 6

We did have something similar albeit on a smaller size. The software, WAN, servers and clients had not changed. However we discovered that there was a software 'error' which had only revealed itself once the database had reached a certain size.

The network guy found it when he put a'shiffer' on the network checking the stations that had the most trouble. The SQL Server was doing fine; it had just got battered and was dropping connections.

Madame Artois

lrapoport SSC Rookie Points: 48 More actions · Answer 7

Some interesting thoughts, guys, thanks. Here's what we can see thus far:

The clients affected vary more or less randomly.

Putting a count in our SPID-killer sp is a reasonable idea, I'll look into that. But just manually looking at the SPIDs, they definitely grow rapidly in quantity over time. Eventually they start spawning off at the rate of 15 or 20 a minute, and we have to restart SQL Server just in order to clean them out and maintain reasonable performance. After which, like I said, things are OK for another 10-12 hours.

On both sites, db is the same size (it's actually replicated from the 'good' to the 'bad,' site; forgot to mention this but it could be a factor). And the code is the same on all the clients.

We've moved over a few bunches of servers and it doesn't seem to affect things much. So right now it doesn't seem to be related to the fact that the 'bad' site had a little more traffic than the good one.

Still working, appreciate your interest.

Regards,

Larry R.