December 5, 2014 at 11:45 am
I am wondering if it is possible that the sql server service on that host machine has eaten up most of the available ram, to where the other processes have to write their memory to disk. Worth checking out if you are nearly out of RAM.
----------------------------------------------------
December 5, 2014 at 11:54 am
Thanks, Martin. In our case even a warm boot didn't clear the issue. The machine took forever to come back up and was very slow right away. This leads me to believe it's a hardware problem, rather than anything that was running.
It came back up normally after a cold boot (pulling the power cables).
December 5, 2014 at 12:15 pm
The issue was definitely the raid controller in our case. SQL Server has memory levels configured for 64GB because we use 2008 R2 standard and the OS has plenty available above the 64GB level. Also we use SQL Monitor and would have been notified of any kind of memory issues. At any rate Dell confirmed that it was an issue with their controller, they just couldn't come up with a solution that actually stopped the problem from occurring.
January 2, 2015 at 4:36 pm
Have you solved this issue yet? I've seen something like this before and it was a driver issue.
Watch my free SQL Server Tutorials at:
http://MidnightDBA.com
Blog Author of:
DBA Rant – http://www.MidnightDBA.com/DBARant
January 5, 2015 at 6:09 am
We updated the firmware as instructed by Dell, but we never were able to completely resolve the issue. We did setup cachecade using an SSD, which basically replaces the on-board controller cache using the SSD in it's place. We didn't seem to have the same issue occur on the only server where we tried this setup. Depending on how many servers you have though, this could be a cost prohibitive solution from my perspective. I would say it's still a better bet to look at replacing the raid controller with a different brand if possible.
January 9, 2015 at 6:58 am
If a warm boot doesn't solve it but a cold boot does, I'd start looking for hardware problems. I've had systems in the past (long long ago on old hardware) that had a cache disabled by firmware due to high error counts. CPU utilization numbers would go through the roof since the CPU got painfully slow. Going to memory for every instruction fetch takes TIME! On these systems (running OpenVMS not windows), there would be a message on the console and entries in the error log indicating the cache was disabled.
Check all the logs for your system. Look for hardware errors, even correctable events and look for anything that might have gotten disabled due to errors. If this is an EFI based server, check the console event logs (SEL) as well.
January 9, 2015 at 7:06 am
I'd really suggest that everyone read the entire thread before commenting. This issue is the result of the cache getting disabled on the raid controller under instances of high I/O. DELL has no known fix for this issue when it occurs. We gave them all of our logs and their only response was to have all of the firmware updated on the controller, which by the way didn't fix the problem. The only solution we have found was to use cachecade, which we are still not 100% sure actually fixes the problem, but the 1 server we configured this way hasn't had the issue since. The other option is to change to a different raid controller if possible. We have since decided to go a different route on our server builds utilizing the Intel server configurator and building the servers in-house.
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply