Sql 2012 server - Log Drive filled up by a runaway transaction - tips to investigate?

  • So this morning I was greeted with a number of error messages from my SQL server that jobs failed and drives were full, happy Monday…

    I’ve not had an event such as this before so I’ve done some initial digging in, but am a little stumped. If you have any tips, suggestions or resources I ought to look at, or post-event cleanup I’m all ears. I am not finding a lot of information to determine the exact query / job / ?? that caused the issue - assuming its possible.

    So one of the first things after an initial poke through the logs was to run dbcc checkdb and then take a full backup.

    To cross a few things off the list:

            Sql Server Agent - Error Logs -> nothing in current set

            Cross Server Jobs – checked other servers - no issues or prolonged durations.

            Event Log – only entries in there are that the drive is full.

    So this seems to be related to just this box and isn’t caused by a OS issue. The server appears to have recovered gracefully which is a plus. As shown below, I think the issue started around 5:45 am, and the 6 am log backup was successful, but the rogue transaction continued to run until it filled the drive around 6:15.

    So I've been able to figure out When, but not the What transaction caused the problem and Why it didn't complete but filled the log drive.

    Management - Server Log

    1:35am        part of the nightly restart of services

    4:00am         [otherDB] Diff Backup + backup success message

    5:00am         [BrokenDB] log backup - no success message, guess there isn't one for logs.

    6:00am         [otherDB] diff backup + success message

    6:02:49        [BrokenDB] Log backup

    6:03:44        autogrow for [BrokenDB] taking 61556 miliseconds (61 seconds)

    6:06.... 6:12am - more autogrows

    6:14:02

            Error: 17053, Severity: 16, State: 1.

            G:\Sql-Logs\[BrokenDB].ldf: Operating system error 112(There is not enough space on the disk.) encountered.

            Error: 9002, Severity: 17, State: 4.

            The transaction log for database '[BrokenDB]' is full due to 'ACTIVE_TRANSACTION'.

    6:14:43

            The transaction log for database '[BrokenDB]' is full due to 'ACTIVE_TRANSACTION'.

            Error: 3314, Severity: 21, State: 3.

            During undoing of a logged operation in database '[BrokenDB]', an error occurred at log record ID (1513969:67816:427). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

            Database [BrokenDB] was shutdown due to error 3314 in routine 'XdesRMReadWrite::RollbackToLsn'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.

            Error during rollback. shutting down database (location: 1).

            Error: 9001, Severity: 21, State: 5.

            The log for database '[BrokenDB]' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

            Error: 3314, Severity: 21, State: 3.

            During undoing of a logged operation in database '[BrokenDB]', an error occurred at log record ID (1514032:13151:195). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

            Error: 3314, Severity: 21, State: 5.

            During undoing of a logged operation in database '[BrokenDB]', an error occurred at log record ID (1513692:9003:840). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

    6:14:45        Starting up database '[BrokenDB]'.

    6:14:47        Recovery of database '[BrokenDB]' (5) is 0% complete. Phase 1 of 3. This is an informational message only. No user action is required.

    6:17:42

            Recovery of database '[BrokenDB]' (5) is 0% complete. Phase 2 of 3. This is an informational message only. No user action is required.

            1 transactions rolled forward in database '[BrokenDB]' (5:0). This is an informational message only. No user action is required.

    6:17:43        CHECKDB for database '[BrokenDB]' finished without errors on 2014-11-30 23:45:43.823 (local time). This is an informational message only; no user action is required.

    6:22:37        Recovery of database '[BrokenDB]' (5) is 99% complete. Phase 3 of 3. This is an informational message only. No user action is required.

    6:22:40

            1 transactions rolled back in database '[BrokenDB]' (5:0). This is an informational message only. No user action is required.

            Recovery is writing a checkpoint in database '[BrokenDB]' (5). This is an informational message only. No user action is required.

    6:22:51        Recovery completed for database [BrokenDB] (database ID 5) in 483 second(s) (analysis 14031 ms, redo 161150 ms, undo 297861 ms.) This is an informational message only. No user action is required.

    7:04:41        Log was backed up. (22 GB! vs normal 10s to 100s MB)

    Disk Usage Report

    The timestamps do not match the log above. The report shows autogrow happening at 5:46 vs above it first appears at 6:03. Guess is that the 6:03 report is the culmination of a number of back to back auto grows so the log doesn't record each individual one.

    The log had been at approx 600MB and is set to 10% auto grow, so the 5:46 request of 61MB matches expectations - therefore there aren't any missing request / issue started at 5:46, not the 6:03 listed above.

  • First thing to look at is the log_reuse_wait_desc in sys.databases. Then you can look at DBCC OPENTRAN() or query the sys.dm_tran_% DMV's to see what the transaction is.

    As a side note, you probably shouldn't keep the log autogrow set to 10%. You should first find the biggest you'll normally need the log to be and set it to that and then determine a good autogrowth increment (not a percent). I point people to these 2 blog posts by Kimberly Tripp for dealing with transaction logs:

    http://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/

    http://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/

  • Thanks for the links on the autogrow.

    The log_reuse_wait_desc and DBCC OPENTRAN() didn't return anything interesting - are these more likely to be useful right when / after the issue occurs? It was about 2 hours from when the event and my seeing the alerts.

  • Yes, all these troubleshooting options are useful when the issue is occurring.

    It sounds like the database was restarted and whatever transaction that caused the log to grow was lost (the 1 transaction rolled back in recovery). You don't happen to have an ETL process or Index Rebuild job happening at that time do you? I don't know how big your database is but your log file started out at about 600MB and grew to over 45GB in 30 minutes.

    Are all the database settings the same as they were prior to the weekend (CDC or change tracking enabled, for example)?

  • No changes had been made to the server over the weekend and we preform most of our maintenance between 11pm - 1am so I don't think it was related to that.

    There are some ETL scripts that run on the 1st of the month, but my coworker in charge of those scripts wasn't aware of any failing. I looked at the job history table and there were a handful of jobs that ran between 5:30 and 6:30 - the longest duration was 451, so just shy of 5 minutes. All the others were in the realm of a few seconds to 2-3 minutes. I would expect if one had a transaction open for 30 minutes its duration would have reflected that.

  • Considering that today is the first day of a new month, perhaps some eager beaver was trying some new ad hoc query for month-end reporting and ended up with an accidental CROSS-JOIN (accidental many-to-many join).

    I'm not sure if the dynamic management views would contain anything about the query that was running because the database did bounce but, since the whole server didn't bounce, it might be worth taking a look.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • D'oh, didn't think about checking dmv for longest running queries. Except now its too late as the server has been restarted. (Restarts every day at 1am. 3rd party software requirements, hands are tied).

    What would happen if I set a limit on the log size for this DB? It generally sits around 1/2 a GB and if I put a cap on of 40GB, if this rogue transaction were to run again would it bomb when it hit the cap, or would it loop around and start overwriting the log file? I don't like that it broke the db, but on the other hand I don't like the thought of that transaction continuing to run corrupting log files until it was caught and killed.

  • JustABob (12/2/2014)


    D'oh, didn't think about checking dmv for longest running queries. Except now its too late as the server has been restarted. (Restarts every day at 1am. 3rd party software requirements, hands are tied).

    What would happen if I set a limit on the log size for this DB? It generally sits around 1/2 a GB and if I put a cap on of 40GB, if this rogue transaction were to run again would it bomb when it hit the cap, or would it loop around and start overwriting the log file? I don't like that it broke the db, but on the other hand I don't like the thought of that transaction continuing to run corrupting log files until it was caught and killed.

    Log files don't "wrap" unless they've been properly backed up or you're in the "SIMPLE" recovery model (and I don't recommend that for anything other than a sandbox DB) and there's no long winded query that has consumed all the space whether you let it grow or cap it.

    The key here would be to change to a fixed growth rate (say 100-500MB) so that when a runaway does occur, you might have more time to catch it instead of having a percentage suddenly allocate all the rest of the disk when it gets big enough. You might also want to setup an alert for growth so that you know something is going on.

    There might be something left in one of the old default traces to help you isolate what happened. See the following article for more details on how to check old default trace files after a restart.

    https://www.simple-talk.com/sql/performance/the-default-trace-in-sql-server---the-power-of-performance-and-security-auditing/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the link, I've been able to skim it and while it may not solve my immediate issue there is a lot of great info there.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply