SQL 2008 database in recovery for 22+ hours!!!?

Question

SQL 2008 database in recovery for 22+ hours!!!?

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Philip Millwood-419646 (10/19/2009)
By the same token, I'm guessing that replication is part of why the log file recovery is taking so long based on a couple things I've found.

Shouldn't be. Replication may make the log larger, it shouldn't affect the size of the active portion of the log, and that should be all that SQL needs to look at for recovery.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 2

Philip Millwood-419646

SSC Eights!

Points: 873

October 20, 2009 at 1:03 am

#1067986

Phase 2 is running now. @ 60% for the past half hour.

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 3

One thing I think of is fragmentation. I am by no means an expert on SAN, but restores I have done before on local disks have taken much longer than usual because of fragmentation on the disk you want to restore to.

N 56°04'39.16"
E 12°55'05.25"

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 4

May have found the culprit. Will post more once I can think clearly, as I'm going on 22 hours here, but it looks like one of the 3 data files is corrupted as my restore is having the same problem once it tries to restore the logs. There was a great deal of purging in the logs from a specific file group over the weekend, and I'm seeing "misaligned log IOs which required falling back to synchronous IO."

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 5

That does sound like an issue with the blocks or files. I would definitely have the SAN guys double check and be sure that all firmware/drivers are up to date.

Also, once you get a system, be sure that you get a good CHECKDB run.

Jonathan Kehayias One Orange Chip Points: 26778 More actions · Answer 6

Steve Jones - Editor (10/19/2009)
That seems like a lot of VLFs, but I'd like to know what Gail and others think.

It is roughly 2.5 times the number of VLF's that would be recommended using Kimberly Tripps size calculations:

Transaction Log VLF's too many or too few

For a 240GB Log it should ideally be created in 30 8GB chunks which would have 16 VLF's each, .5GB per VLF. This would be 480 VLF's instead of 1200.

Jonathan Kehayias One Orange Chip Points: 26778 More actions · Answer 7

Steve Jones - Editor (10/20/2009)
That does sound like an issue with the blocks or files. I would definitely have the SAN guys double check and be sure that all firmware/drivers are up to date.
Also, once you get a system, be sure that you get a good CHECKDB run.

+1, though if you have real corruption that is slowing down recovery to this point I'd have expected that you'd have other errors being logged at some point.

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 8

2.1 billion transactions rolled forward. 34 rolled back.

It finally finished around noon. I have had a few hours sleep now, and will be taking a fresh look at it tomorrow.

Thanks everyone for your help and suggestions. I probably won't be able to do much with it until the weekend.

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 9

Wow, that's something.

Let us know if you find anything. This has been very interesting.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 10

Philip Millwood-419646 (10/20/2009)
2.1 billion transactions rolled forward. 34 rolled back.

If I may suggest...

Do some ongoing monitoring on the frequency and duration of the checkpoint operations. With a default recovery interval, the checkpoints should have been running often enough that 2.1 billion committed transactions should not have been able to accumulate in memory and tran log only. There could very well have been something preventing/slowing the checkpoint operations, maybe IO bottleneck.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 11

What about a 2B row transaction that occurred during the backup? Wouldn't that end up in rollforward?

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 12

I could easily understand a 2.1 billion row transaction rolling back during recovery, that would mean it hadn't hit the COMMIT at the time that SQL was stopped and had to be completely undone. That makes sense.

Having that much roll forward means that the commit(s) did occur before the shutdown, but that the dirty pages hadn't been written to disk. Since checkpoint should be running regularly, writing all dirty pages to disk (even those where the transaction hasn't committed), having that much roll forward suggests that something was inhibiting the checkpoint before the shutdown. Could be that it was blocked (though I don't know what blocks a checkpoint), could be trouble logging it's competion, could be IO contention.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 13

But doesn't the roll forward include the log records in the full backup? The backup starts, this 2.1B row trans then hits a table that's got some pages in the backup, and when the backup is done, the log records committed since the start of the backup are written out.

That was what I was reading from here: http://msdn.microsoft.com/en-us/library/ms191455.aspx

Am I missing something?

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 14

Steve Jones - Editor (10/21/2009)
Am I missing something?

I thought we were talking about the restart-recovery, not the restore from backup.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

HowardW One Orange Chip Points: 29109 More actions · Answer 15

I thought the issue was with a recovery after reboot, not recovery after restore?

It was my understanding that regardless of recovery interval, the data files are not written to until the transaction is committed?

So if there was a single gigantic transaction that was committed just before the reboot, it would still have to do a roll forward on the entire transaction if a commit had reached the log. It's hard to see that taking 30 odd hours on decent hardware even if an entire 200+GB log needed rolling forward, so it does sound like there's some I/O issue.

However if individual transactions are relatively small, in the event of an I/O bottleneck, it should throttle the speed at which dirty pages can be written to memory until the buffer cache has been flushed.