January 7, 2014 at 1:38 pm
Hi Guys,
We are suffering serious performance issues at work which come on suddenly and require a failover or reboot of the server. When it happens the server is totally unresponsive an exec requests DMV shows that there is one particular stored proc in queued up about 70 times, all with recompile wait resources.
I ran perfmon from yesterday to today and caught the issue in the act so I have some pretty good data to go on.
If anybody wants I can send you the .blg perfmon file to inspect but if not then here is a general overview of the data that I collected.
Looking at the data I can see that this isn't a gradual thing, it looks like it comes on hard and fast. Here is what happens:
Average temp tables: Drop to nearly 0 from 50 Temp Tables For Destruction: Skyrocket Compilations/sec: Jump from averAverage temp tables: Drop to nearly 0 from 50
Temp Tables For Destruction: Skyrocket
Compilations/sec: Jump from average of about 9/s to about 70/sec
Active Parallel threads: Skyrocket off the chart briefly and then go to 0. Previous to the crash they were averaging about 10
Average Latch Wait Time (ms): Averaging about 7 and then shoots up to about 40-50
Read and Write Disk Queue Length: Lots of activity until the crash. They then go down to basically 0.
Batch Requests per second: Stable until the point of the crash and then a sharp blip up and then drop drastically.
Blocked Tasks: Huge Spikes off the chart after the crash.
Bulk Copy Throughput/sec: Frequent spikes before the crash for a couple of hours and then one single spike right at the point of the crash. Another shortly after the crash.
Cache Hit Ratio: Drops by about 40%
Connection Resets/Sec: Sharp blip up before crash and then drop off to very low levels:
Extended procedures: A few tiny blips 15 mins before the crash. Huge spikes after the crash
Failed Auto-Params/sec: Few blips and spikes for about an hour before the crash. Big spikes after the crash.
Free-List Stalls/sec: Big spikes for about an hour before the crash. Big spike on crash and then 0
Latch Waits/sec: Steadily increase until the crash and then drop right off to nearly 0
Lazy Writes/Sec: Huge spikes consistently for about an hour before the crash and then 0
Lock Timeouts/Sec: Quite big spikes before the crash. Drop right off to 0 during the crash
Lock Wait Time(ms): Starts going haywire around about the time of the crash, during and after the crash.
Log Flush Wait Time: Quite big spikes for about an hour before the crash. 40 mins before the crash wait time increases significantly (80%) for about 20 mins and then drops back to normal levels. Spikes back up again at about the time of the crash.
Memory Grants/Sec: Stable until the crash and then drop right off
Memory Grants Pending: Around about 0 until the crash and then increase significantly
Non page latch waits: Consistently spiky until the crash and then drop down to 0
Page IO Latch Waits: Consistent medium sized spikes until the crash and then increases significantly
Page Lookups/Sec: Stable until crash, small spike during crash, gradual reduction after crash
Page Reads/Sec: Consistent medium sized spikes until the crash and then drops to 0
Page Splits/Sec: Consistent massive spikes and then drop to 0 immediately after the crash.
Query Optimizations/Sec: Consistent mini spikes leading up to crash and then huge surge during/after crash
Readahead pages/sec: Consistent big spikes until crash and then 0
SQL Attention Rate: Basically non existant until the crash and then increases quite significantly
SQL Compilations/Sec: consistent medium spikes until crash and then an absolutely massive surge in the number
SQL ReCompilations/Sec: A Few big spikes 30 mins before the crash. More big spikes about 10 mins before the crash. One spike during the crash
Stolen Pages: Very Stable until the crash and then drops off to nearly 0
Table Lock Escaltions/Sec: Very consistent mini spikes until about 10 mins before the crash. Right before the crash this increases significantly and remains high after the crash.
Temp table creation rate: Very low leading up to crash and then increases significantly after the crash
Temp Tables for destruction: 0 right up until the crash and then huge spikes during and after the crash
Total App Domains Unloaded: Very stable until crash and then increases during and after the crash significantly
Total Latch Wait Time(ms): Medium sized consistent spikes for an hour before the crash. Big spike during crash. 0 After crash
Transactions/Sec: Stable until the crash. Big spike right before the crash. Drop off to low levels immediately after the crash.
Unsafe Auto-Params/Sec: 0 until after the crash. 3 Big Spikes after the crash.
Usage: Korean_Wansung_Unicode: Stable level until after the crash. Increases significantly after the crash
Usage: Table Hint Without With: Stable until shortly after the crash when it starts to increase significantly
Usage: XP_API: Stable until right after the crash when this skyrockets off the chart
Usage: DataTypes: text ntext or image: Low levels and stable before crash. Increases a lot shortly after crash
Usage: Syslogins: Stable until crash. Skyrockets to top of chart after the crash.
I want to just add to this that we have a job that runs every 10 minutes and does bulk loads of CSV files for any clients who have sent new data over to us. It does this by using a CLR procedure to check if there are any new folders on the network share and then bulk loads them using a format file if there are. I don't know how relevant this is but I have a sneaking suspicion that this might have something to do with the problems we are getting.
I've also checked on the server for any potential scheduled tasks that could cause problems and there basically aren't any. The only thing installed on there as well is EMC Powerpath 5.3. I don't know what this is but could this potentially cause issues?
One more thing to note is that we are running on a fail over cluster and basically it just isn't configured correctly. It sounds as though the guy who set it up originally (who has now left the company) didn't do too great a job at it and the guy that looks after it at the moment just doesn't have any experience looking after these things and doesn't want to touch it in case he breaks anything. He tried to fail it over the other day and nothing happened so that is the state that it is in. I had a quick look into it today and there are error messages such as the folowing:
"Cluster node '****' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges."
"The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges."
"Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible."
"The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data."
"Cluster resource 'Cluster Disk 5' in clustered service or application 'Cluster Group' failed."
Perhaps this has something to do with the issues that we are seeing as well. If you could add some comments in about this as well then that would be great. Hopefully this will be enough information for you to go on. I just hope I can get to the bottom of this soon!
Again if you want I can send you the .blg perfmon file to inspect if that would help. I really look forward to your response.
Thanks again.
January 7, 2014 at 6:52 pm
Seems like one of the node is removed from the cluster, normally, you should escalate this to windows folks, if you are not a cluster expert.
You should add the node back to the cluster and then try to failover the group.
Regards,
SQLisAwe5oMe.
January 8, 2014 at 2:03 am
Hmmm...
I would suspect the statistics are going out of date and leading to a bad plan which is locking everything up, but that's a guess. There's not quite enough information from what you've given.
You've isolated it down to a particular query and a particular blocking process, recompile. That's a bottleneck on CPU. So, questions are, how big is this query? A recompile operation is normally a few milliseconds to a few hundred milliseconds, so for it to completely lock the system implies a longer recompile. I had an 86 table join query that took about 3.5 minutes to recompile, just as a measuring stick, but it didn't require rebooting the machine to clear it.
Instead of a reboot, have you tried killing the offending process that is blocking everything? What happens if you do this?
Since you know the query or queries involved, I'd focus my information gathering efforts around those rather than system metrics. In particular, get an execution plan before the problem occurs and one during the problem on the query that is particularly locking things up.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply