SQL 2008 database in recovery for 22+ hours!!!?

  • Philip Millwood-419646 (10/19/2009)


    By the same token, I'm guessing that replication is part of why the log file recovery is taking so long based on a couple things I've found.

    Shouldn't be. Replication may make the log larger, it shouldn't affect the size of the active portion of the log, and that should be all that SQL needs to look at for recovery.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Phase 2 is running now. @ 60% for the past half hour.

  • One thing I think of is fragmentation. I am by no means an expert on SAN, but restores I have done before on local disks have taken much longer than usual because of fragmentation on the disk you want to restore to.


    N 56°04'39.16"
    E 12°55'05.25"

  • May have found the culprit. Will post more once I can think clearly, as I'm going on 22 hours here, but it looks like one of the 3 data files is corrupted as my restore is having the same problem once it tries to restore the logs. There was a great deal of purging in the logs from a specific file group over the weekend, and I'm seeing "misaligned log IOs which required falling back to synchronous IO."

  • That does sound like an issue with the blocks or files. I would definitely have the SAN guys double check and be sure that all firmware/drivers are up to date.

    Also, once you get a system, be sure that you get a good CHECKDB run.

  • Steve Jones - Editor (10/19/2009)


    That seems like a lot of VLFs, but I'd like to know what Gail and others think.

    It is roughly 2.5 times the number of VLF's that would be recommended using Kimberly Tripps size calculations:

    Transaction Log VLF's too many or too few

    For a 240GB Log it should ideally be created in 30 8GB chunks which would have 16 VLF's each, .5GB per VLF. This would be 480 VLF's instead of 1200.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • Steve Jones - Editor (10/20/2009)


    That does sound like an issue with the blocks or files. I would definitely have the SAN guys double check and be sure that all firmware/drivers are up to date.

    Also, once you get a system, be sure that you get a good CHECKDB run.

    +1, though if you have real corruption that is slowing down recovery to this point I'd have expected that you'd have other errors being logged at some point.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • 2.1 billion transactions rolled forward. 34 rolled back.

    It finally finished around noon. I have had a few hours sleep now, and will be taking a fresh look at it tomorrow.

    Thanks everyone for your help and suggestions. I probably won't be able to do much with it until the weekend.

  • Wow, that's something.

    Let us know if you find anything. This has been very interesting.

  • Philip Millwood-419646 (10/20/2009)


    2.1 billion transactions rolled forward. 34 rolled back.

    If I may suggest...

    Do some ongoing monitoring on the frequency and duration of the checkpoint operations. With a default recovery interval, the checkpoints should have been running often enough that 2.1 billion committed transactions should not have been able to accumulate in memory and tran log only. There could very well have been something preventing/slowing the checkpoint operations, maybe IO bottleneck.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • What about a 2B row transaction that occurred during the backup? Wouldn't that end up in rollforward?

  • I could easily understand a 2.1 billion row transaction rolling back during recovery, that would mean it hadn't hit the COMMIT at the time that SQL was stopped and had to be completely undone. That makes sense.

    Having that much roll forward means that the commit(s) did occur before the shutdown, but that the dirty pages hadn't been written to disk. Since checkpoint should be running regularly, writing all dirty pages to disk (even those where the transaction hasn't committed), having that much roll forward suggests that something was inhibiting the checkpoint before the shutdown. Could be that it was blocked (though I don't know what blocks a checkpoint), could be trouble logging it's competion, could be IO contention.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • But doesn't the roll forward include the log records in the full backup? The backup starts, this 2.1B row trans then hits a table that's got some pages in the backup, and when the backup is done, the log records committed since the start of the backup are written out.

    That was what I was reading from here: http://msdn.microsoft.com/en-us/library/ms191455.aspx

    Am I missing something?

  • Steve Jones - Editor (10/21/2009)


    Am I missing something?

    I thought we were talking about the restart-recovery, not the restore from backup.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I thought the issue was with a recovery after reboot, not recovery after restore?

    It was my understanding that regardless of recovery interval, the data files are not written to until the transaction is committed?

    So if there was a single gigantic transaction that was committed just before the reboot, it would still have to do a roll forward on the entire transaction if a commit had reached the log. It's hard to see that taking 30 odd hours on decent hardware even if an entire 200+GB log needed rolling forward, so it does sound like there's some I/O issue.

    However if individual transactions are relatively small, in the event of an I/O bottleneck, it should throttle the speed at which dirty pages can be written to memory until the buffer cache has been flushed.

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply