September 12, 2012 at 11:23 am
I am looking to see if anyone else has a few ideas of where to start in looking for problems or possible look for something I missed.
Problem: a few times per day, our internal traffic to our SQL server will "freeze" and our web interface and VB 6 windows application will also "freeze". No external web surfing or IP phone communications are affected. The "freeze" happens for about 2 to 5 minutes in length. Every workstation affected at the same time, for the same duration. The problem occurs AROUND 15 minute time intervals, but typically does not start directly on the mark but instead a minute or two away. Example, 9:00:50, 9:17:10,9:47:05,10:00:30. These times have been consistent the past 3 days. When the "freeze" is over, everything just resumes as though nothing happened.
Anyone have things that I should look towards as a suggestion, it would be quite appreciated. Below are a few things I have been looking into with no evidence of what is causing the problem.
I have Confio Ignite 8 and working with their support and don't really see anything other than for these "freezing" times, hardware data and sometimes SQL statement information gathering stops. There is nothing recorded and Ignite sows chunks of missing data. When it can pick up some data, there is nothing showing strain on the server.
I have perfmon counters set for CPU, Memory, and Disk. During the "freezing", I see no anomalies or large strain. I see nothing to indicate the server is physically struggling.
I have captured sp_who2, sp_who2 active, and custom "running SPIDS" query result data during these "freezes" and see nothing hitting the server hard, I do not see a strain on the server. I see a normal amount of open connections, nothing being dropped, and a normal amount of running connections and normal activity. Nothing putting strain on the server. There is no blocking. There is a very small number of transactions in queue.
I have checked the SQL server error log and seen nothing of value for this case.
I even went to the pain of running a profiler against production during the times I knew there would be a "freeze" - to pull ANY errors from SQL server. Nothing but some log completion events, no errors.
I have switched some SQL Agent timers to execute on different times.
I have worked through a few things with RedGate support for log shipping since we go for 15 minute intervals (but on the 15 minute marks exactly).
Only once ever did I get a user reporting an error (reported from user):
9/11/2012:
9:00 am 2 mins
9:32 am 2 mins
9:46 am 5mins
This error came up once it unfroze:
Error at: 9/11/2012 9:49:19 AM
Error: [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
I have had the Network Admin check the event logs on the SQL server, and the switch logs. Nothing came up during the problematic times.
We have an audit mechanism that shows us how long individual queries take. During these freezing time periods, the time taken for execution SOMETIMES shows execution duration spikes when things are frozen. As in, it seem to keep the connection, just not do anything with it, and then resume after 2-5 minutes, and then log that the execution took longer. The mechanism is that built into the app the application takes a timestamp, runs the query, takes another timestamp, then logs what was executed and how long it took. There are spikes, but this doesn't really give detail as to WHY. The web application were no errors are seen has a timeout for some pages of Our main app has a built in timeout of 30 minutes. Web has between 2 and five depending on what is being used. Whatever is being used no one is reporting timeouts or thrown/shown errors.
September 12, 2012 at 2:15 pm
Is the database set to auto close?
September 12, 2012 at 3:23 pm
Auto Close is false on all system and users databases.
September 13, 2012 at 8:53 am
At 9 am, I could run sp_who2 and get query results. CPU was at 40ish percent and nothing I don't see often when no blocking was occurring (file attached)
At 9:30 am I could NOT run sp_who2 (query froze until the server was done doing whatever it was doing - finally returned results when it was done). The CPU level dropped to about 5% on all cores. When the connection came back there was a CPU spike, but with the spike was actual movement.
Sometimes the CPU spike is there and that causes a block. I just had a "freeze" at 9:50 which I have never had a freeze at 9:50 before, and watching task manager and all my other alert mechanism, nothing was any different from 9:49 or 9:48, visibly.
September 13, 2012 at 11:17 am
Not every time but a couple times this morning I have captured a small group (10-15) of ATTENTION events in Profiler AFTER these freezes are over. I very seldom otherwise see the attention event and it's coming from IIS so I figure those are timeouts from someone running a report. But when the freeze happens and ends, a few of the events are dumped in all at once, sometimes.
Saw this:
"The Attention event class indicates that an attention event, such as cancel, client-interrupt requests, or broken client connections, has occurred. Cancel operations can also be seen as part of implementing data access driver time-outs"
Also, one time of all the time during locking I got an error message from a user (this error has only ever been seen once during all the blocks):
"Error: [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
Called: frmMain:SearchTran"
We get this kind of error some times when our switch reboots. That error was only seen once though of all the countless freezing times.
September 13, 2012 at 11:21 am
Some times but not all when the freeze occurs, there are the two message multiple times in the SQL Agent Log:
[382] Logon to server 'MANNASQL2' failed (ConnAttemptCachableOp)
[165] ODBC Error: 0, Unable to complete login process due to delay in opening server connection [SQLSTATE 08001]
I have seen little resource on this except perhaps the Domain Controller pass through is suspect.
No other messages are seen in any event logs or server logs.
September 13, 2012 at 11:24 am
I have to ask, is SQL Server the only component being monitored at this time? Just a gut feeling, but I am having a hard time believing this is a SQL Server problem, at least on its own. Something tells me there is more going on with this.
September 13, 2012 at 11:31 am
The box is a dedicated machine and other processes are checked with no variance seen on the box when the problem occurs. The box has 36 GB of mem total where I allocated 20 GB for the SQL Server service. SSAS is also on the box, but very minimally used and our cube is re-processed once nightly. No one is touching it during the day on anything other than selectivity to query the cube.
Since day 2 there was suspect of domain controller pass through authentication, internal firewalls, and our switches.
Each of these have been checked as well as system and application event logs to nothing conclusive.
I had also started to monitor our DR server which has log shipping going to it. I have been looking at perfMon collections from our file servers and other servers to nothing that pops out to say those servers are having any issues at the same time (though not as thoroughly as the SQL server since I really need my Net Admin to be doing that).
The machine itself does not have any other processes taking any additional resources I can see (though admit I could very well be missing something).
However this all also has me now leading that I think it's pass through authentication... but I have nothing to prove at this time that it's not the SQL server as the SQL box has such inconsistent data captures in results when a freeze occurs.
September 13, 2012 at 11:33 am
Is this server the only server running SQL Server?
September 13, 2012 at 11:44 am
We have a DR server for log shipping sitting dormant aside from restores. It's another dedicated box that used to be our production server 4 years ago. When my company upgraded the production server, the older one got sent to a DR facility and captures and restores logs. We have a "stage"/testing box with cpu license. We have a "development"/coding box on server with 5 or so dev cals and connections, sitting on a VM.
September 13, 2012 at 11:47 am
I did not think to have our application connected to both production and stage and see if both connections are frozen. Wheels are turning on thoughts of more things to test.
September 13, 2012 at 11:48 am
When did the system start to freeze? Would need to go back and check to be sure, but iirc you say nothing has changed when it started happening? All I can say, something changed somewhere. A patch to a server, a switch, a network driver, something. The other option, something is starting to fail somewhere.
September 13, 2012 at 12:15 pm
"When did it start", is still in the information gathering process.
I first knew about this as an issue on Monday, when I was putting dashboards on a TV and happened to talk to an end user of the problem applications, sitting there next to the TV. I have been asking people and my results are "it's always been slow", "this is the first time I ever saw it" (though they knew about others having the problem), and "about 3 weeks it started showing up" or "about 3 weeks it started getting worse but it's been there for a while".
Apparently, everyone in the company using the applications that freeze has known about this for at least 3 weeks and reportedly longer. That's pretty much everyone but IT, so we have no real track record of it here.
Two weeks ago I reset all log shipping. Two months ago I restarted the server. There have been no patches to the production server this year but our DR server probably had patches about 5 months ago. Restart of the server is on my list, but has not been executed during this business week. The version of SQL 2005 is SP4 and has been since release.
September 13, 2012 at 12:44 pm
Well, that's part of it. What about DNS servers, routers, switches, etc?
September 13, 2012 at 12:51 pm
UPS switch for our batteries 2 weeks ago, network switches on servers about 11 months ago we put in new switches for our phone system. DNS unchanged for unknown time, over a year, same with router, firewall, and external communication providers.
Viewing 15 posts - 1 through 15 (of 40 total)
You must be logged in to reply to this topic. Login to reply