June 26, 2011 at 12:06 pm
Hi guys,
Sorry in advance about the long post. I ran into a little bit of a head scratcher yesterday and I am wondering if any of you have similar behavior. Here is my system config:
Clustered HP ML 380s
Two physical CPUS/8 logical
20 GB RAM - 16GB dedicated to SQL
Databases stored on SAN
SQL Server 2005 sp2
Windows Server 2003 sp2
At about 8:00 pm 2 nights ago I started seeing some latch timeouts in the event and SQL log. Minutes before that was happening, my transaction log backups started to report failures. Even though the maintenance plan was reporting failure, it looks like the backup files were of proper size and being generated correctly. The full backups just after midnight reported failure as well but SQL server was stating the last successful backup occurred at that time. Here are the errors I was getting. I understand that they are related to IO problems but at this time the disk looks healthy. I am going to re check with our network guys to confirm.
First error after transaction logs from 2 separate databases failed:
Timeout occurred while waiting for latch: class 'ACCESS_METHODS_SCAN_RANGE_GENERATOR', id 63151800, type 4, Task 0x00D80B68 : 4, waittime 300, flags 0x1a, owning task 0x00DADC48. Continuing to wait.
I then got a stack dump of the stored procedure that was waiting.
Following those errors I got the similar latch time out error:
A time-out occurred while waiting for buffer latch -- type 4, bp 06085BF4, page 1:2332056, stat 0x6c00009, database id: 12, allocation unit Id: 0, task 0x00D6A5C8 : 0, waittime 300, flags 0x3a, owning task 0x00DAC3E8. Not continuing to wait.
Database 12 is one of the databases in which the backup failed prior to all of this starting. The database is also setup for transactional replication. I don't replicate any tables that get huge batch updates against them and replication has been working perfectly for 2 years.
These latch time out and database transaction log backup failures continued through the night. I have never seen this happen as this cluster system has been running without incident for years except for the power supply issue we had 2 days ago which I somewhat describe below.
By the time the morning rolled around simple processes that usually take 2 seconds were taking 30 seconds. One process that usually takes seconds was running for 2 hours and was blocking all transactions from being inserted into a particular table. The transactions were piling up and replication stopped. Once we killed that job, transactions started rolling in and replication started up again but processing was very, very slow. We ended up failing over to the other node on the cluster and everything has been perfect since then. That was yesterday morning.
I guess that pretty much sums things up. The other major thing that happened recently was we lost a bunch of power supplies on our servers at the same time. Both machines in the cluster had this happen and we were down for 9 hours while the parts came. They seem to be working fine now but there was a warning yesterday morning that one of the redundant parts had failed. I don't think it is related to what happened here but I thought I would mention it.
It seems like the transaction log backup failures and the latch timeouts are related. I am trying to piece them together but outside a bad page or faulty ram it is tough to say at this point. I am going to try and dig up any perf mon data we have as well as get the network guys to do a very thorough check of the data center hardware.
Once again I apologize for the long post so I won't write anymore detail. If any of you have experienced this problem or what it might be I would appreciate the feedback and we can detail it from there. Thanks for reading.
Corey
April 30, 2012 at 1:41 am
See if this http://www.sqlmonster.com/Uwe/Forum.aspx/sql-server/55909/LATCH-EX-waits-of-Class-ACCESS-METHODS-SCAN-RANGE-GENERATOR can help you
Hemantgiri S. Goswami | SQL Server Specialist & Consultant
SQL Server Citation[/url] | Follow me on Twitter
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply