August 29, 2012 at 9:54 am
Hi,
Background:
I have a webapp using sql server as the back end.
It is monitored using an external monitoring service to measure response times.
The db is sql server 2005 mirrored on 2 identical machines + witness.
I need to freeproccache every week as it uses ad-hoc queries which fill up the proc cache and gradually slow the response time. freeproccache instantly fixes that issue.
The normal website response rate is average 0.2 seconds.
Problem...
For the last 10 days at random times - including times where the system is barely being used, the response time suddenly increases to average 0.4 seconds. i.e. double.
It will stay at average 0.4 seconds for 13.5 hours and then will suddenly return to 0.2 seconds.
Failing over to the mirror when the primary is at 0.4 seconds returns to 0.2 seconds, but failing back to original goes back to 0.4 seconds.
The primary has 2 other databases on it, not mirrorred which are not being used very much. I have looked at the iis logs of the sites calling the other databases and they are not being used when this happens.
For example this morning it went to 0.4 seconds at 3.30 am. The system was barely being used, backups were much earlier.
They are dedicated machines which no other software on them.
When the primary runs at 0.4 seconds I tried failing over (the mirror ran at 0.2 seconds) rebooting the 0.4 second primary and failing back. It ran at 0.4 seconds again.
dbcc freeproccache has no effect on this. It stays at 0.4 seconds.
This has really stumped me.
I switched off all calls to the database apart from the one used by the monitoring service. It remained at 0.4 seconds.
Something must happen on this machine to cause this. There is nothing in the event logs or sql error logs of any significance. (i.e. nothing at all at the time this occurs).
I appreciate that a response time of 0.4 seconds is still very fast however what if one day it jumps to 4 seconds instead of 0.4. That would be a complete disaster and so I need to understand exactly why this is happening.
Even if a long running query was started to kick this off, it shouldn't increase server response time for 13 hours!
Any ideas would be welcome - I am currently lost with this and might just have to get used to the bizarre behaviour.
Currently I am using the Mirror as primary to see if that also does this after 11 hours!
Many Thanks
Simon
August 29, 2012 at 9:57 am
One suggestion I have is to set up a server side trace and see what is running when this occurs.
August 29, 2012 at 10:06 am
Hi,
I have done a lot with the sql profiler, but it has not got me any further.
One problem with the profiler is that the system runs about 1000 queries a second for most of the day. It is impossible to find the 20 or so queries which are triggered when the monitoring service hits the page to test it otherwise I could compare those.
I ran it for a while when it was 'slow' and there were no queries over 100ms - to be fair though the 'slow down' I am investigating is not huge, which makes it less likely there will be some huge query hanging.
If the queries were the problem then I would expect it to continue to run slow when I fail over to the mirror.
Let me know if I am wrong/overlooking something!
Many Thanks
Simon
August 29, 2012 at 10:12 am
At this point, there really isn't enough information to make any educated quesses, just shots in the dark. Have you looked at other aspects regarding server performance; disks, network, etc.?
August 29, 2012 at 10:19 am
Hi,
I have not looked into disk or network activity - I have played around with perfmon in the past, but tbh don't find it very useful at all. (Perhaps due to my lack of experience - or my actual experience that there is always a db tweak which will solve this type of issue).
I originally presumed it was a search spider hitting the website, found a load and blocked them, but it wasn't that.
It's like there is something else running on the server at that time, but there simply isn't.
It often starts this during times when there is really not a lot (relatively) going on on the server. If there were network or disk problems, I would expect it to get worse during peak times, but it doesn't. It jumps to 0.4 seconds for 13.5 hours then returns to 0.2 seconds! Really strange I know.
Thanks
Simon
August 29, 2012 at 3:31 pm
Since it's a website response time, have you already eliminated the possibility of the issues with web app or network latency between the service and the website?
August 29, 2012 at 3:39 pm
Hi,
Failing over to the mirror speeds up the database to 0.2s.
Having said that though, I have a new idea....
At around the same time this started, one of the data centers DNS servers started intermittantly failing.
This caused a problem on one of the web servers which couldn't resolve certain domain names.
I just noticed that both the SQL Server machines are pointed at those DNS servers. I'm thinking that the db server that has these slow downs may be having problems with the DNS server - I didn't tell the DC I had problems with DNS, I just moved the web server to use Google DNS. With this in mind, I suspect the DNS servers are still broken.
I have no idea why a DNS server would slow down a DB server, however maybe there is something on the machine which is doing something with DNS and failing for the slow period. This all seems far too co-incidental to ignore however unlikely it seems.
The problem is though - as the machines are mirrored I'm a bit concerned about changing the DNS server addresses in case it breaks the mirror. It shouldn't do as it uses the hosts file to resolve each machine and the D/C DNS server is probably broken anyway.
Any thoughts on this!?
Many Thanks
Simon
August 31, 2012 at 1:06 pm
Hi,
I have found the cause of this for anyone who is interested.
It is a faulty raid battery on the mirror server. This causes the raid on the mirror to use disk caching rather than battery backed memory.
It makes the response from the mirror slightly slower, which in turn reduces the overall response time of the primary mirror.
Thanks
Simon
September 1, 2012 at 11:56 am
simon4132-806507 (8/31/2012)
Hi,I have found the cause of this for anyone who is interested.
It is a faulty raid battery on the mirror server. This causes the raid on the mirror to use disk caching rather than battery backed memory.
It makes the response from the mirror slightly slower, which in turn reduces the overall response time of the primary mirror.
Thanks
Simon
Oh my!!! That's the second time in a month that I've heard of such a thing. That's just insane. You would think the hardware vendors would have someway of providing an alert for such a failure. Maybe they do and people just aren't setting it up during the hardware installation.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply