May 29, 2007 at 8:29 am
I assume that you mean the Kernel Memory section of the Performance tab in Windows Task Manager?
May 29, 2007 at 10:53 am
Here's the only message that raises a flag for me:
SQL Server has encountered 15 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:\xxxx_Log.LDF] in database [xxxx] (7). The OS file handle is 0x000005A8. The offset of the latest long IO is: 0x000000feed5a00
There are over 20 of these type within the last 24hrs. Before that there are none within the last week.
May 30, 2007 at 9:56 am
Just an update. I have been working with dell and they did their Dset scan and found no problems. They had me download and ISO image that contains updates for all the drivers on my server. I plan on applying it this weekend and then doing a low level scan of the hard drives. I cannot do this while in production for obvious reasons.
I don't see how new drivers will fix this since the current drivers were working fine before this and I didn't change anything. This seems to be a 'blanket' fix because they don't see a problem.
I'm going to make sure my backups have been moved off the machine before I run the updates!
May 30, 2007 at 10:14 am
Will,
Very strange and thanks for the updates. I'm doubtful on new drivers myself, but you never know.
If you have the space, I'd ask for new drives from DELL as a test, add them as a new array, and move the data and log files over there. That would at least eliminate THESE drives.
May 30, 2007 at 10:50 am
That would be nice if we had the money to do that.
May 30, 2007 at 9:39 pm
Take a look at your MSDB database . All the backup job logging is stored in four tables (backup*). If these get too large, I have seen a dramatic decrease in performance.
Also, do you have any other jobs that would run at the same time as your backup job? This contention could also cause problems.
What backup software are you using?
SJ
May 30, 2007 at 9:45 pm
Sorry, got to reading the other pages....
We had an IBM server do the same thing. We had the beginnings of a logical hard drive failure in our Raid 5 array. The RAID diagnostics wouldn't report anything and then boom crash...and #0 or another would be dead.
IBM told us to update all the drivers, firmeware and such and it helped for a short while until another drive would do the same. Once we got thru all of the 5 drives in our RAID config, it stopped happening. Best I can say is that all of the disks were purchased at the same time and they all decided to die around the same time.
SJ
May 31, 2007 at 6:06 am
I just using SQL Server for my backups. Dump to disk, zip, then copy off machine daily. Then burned once a week to DVD-Rs.
I will take a look at my backup* tables in MSDB. Thanks.
All of our drives were installed at the same time except one. I am pretty sure, but not positive that we have at least one backup. I don't handle that, but I will verify today. The Dell person I talked to said that there was a 'critical' firmware update for the one drive that wasn't like the others on the array. We'll see what happens this Saturday.
I always hate that feeling of dread that comes as the machine is rebooting (hopefully) after a major upgrade/change. Hopefully everything will be fine.
June 5, 2007 at 8:25 am
Update :
1. updated drivers and new maintenance software from dell, problem still exists
2. drives seem to go into 'slow mode' when system is stressed for a long period
3. if system is not stressed for a long period, then drives don't go into 'slow mode'
I have not had enough system down-time to run a full scan on all the hard drives. When I run the scan it puts the drives into 'slow-mode' and requires a reboot to clear, so I can't run the scan while in production. I will have to do one drive at a time this weekend and see how much I can get finished. I only have maintenance windows on the weekend, so I've only had one chance to run the tests since installing the new drivers and scanning software.
When I say 'stressed for a long period', I mean a process that handle 10s of millions of rows that runs for at least 20-30 minutes. When I say 'slow-mode', that's my name for it. I don't have enough info on the problem to give it a better name. My other names for it aren't fit for this forum anyway. ha ha.
June 27, 2007 at 7:26 am
I hate to take a chance on jinxing my server, but I think we may have solved the problem.
Turn out that the battery on the PERC 3 card that controls the array was slowly going dead. It would go into recharge mode and never come out because it wouldn't charge to full capacity. During the recharge it would operate without cacheing. Hence the slow response time.
We replaced the PERC 3 card with a PERC 4 card on Monday and haven't went into 'slow-mode' since. Keeping my fingers crossed that it doesn't come back. We are also seeing about a 20-25% increase in speed with the new card.
We will wait a couple of weeks and then replace the PERC 3 card that control the internal array to a PERC 4. The PERC 3 operates at 160 and the PERC 4 operates at 320, so we should see an increase in speed on the internal array also. That array holds the operating system, so I'm looking forward to that upgrade!
June 28, 2007 at 7:58 am
I've been reading recently that hard drive failure rates are up, especially higher-capacity units. In my home office I have six PC's with a total of eight physical drives; over the last couple years I've had two drives go bad.
Some programs like automatic defraggers and anti-virus scanners can really shorten the life of a disk.
The way I understand the inner workings of a hard drive, it will write & verify; if the verify fails, it will automatically repeat up to four more tries before the disk will report an error. So I suppose you might need a low-level repetitive scan utility to discover that.
June 28, 2007 at 10:37 am
My problem didn't have anything to do with a hard drive going bad. I should have made it clear that the cache was on the PERC 3 card, not the hard drive. Not sure if you responded to the correct post, or maybe I wasn't clear on my post on the solution. When the battery went dead, it stopped using the cache on the card, not on the hard drive. I replaced no hard drives for the solution.
June 28, 2007 at 12:20 pm
Will, on 5/23 you said
"As I said originally hard drives were failing."
but re-reading the entire thread I can see that was not actually the problem. Sorry for butting in.
June 28, 2007 at 1:55 pm
"As I said originally hard drives were failing." - that was suppose to read:
"As I said originally hard drives NOT were failing." - I forgot about that correction.
No problem. Didn't mean to come across harsh. I've just been on a lot of forums where people are trying to get their post counts up and they do a drive-by posting without reading the thread properly.
When I reread my posts I realized that I wasn't clear that it was the card's cacheing that was turned off, not the hard drive's cache.
My bad.
Viewing 14 posts - 16 through 28 (of 28 total)
You must be logged in to reply to this topic. Login to reply