TLOG is full

  • Hi SQLTeam,

    Today we have been receiving continuous log full 9002 alerts. The drive is completely full with 10mb free space.

    Identified an open transaction and associated sql statement which is been run from past 2 days and in sleeping mode doing nothing.

    The database is in FULL recovery models and transaction log backup is scheduled every 1 hour, db size is around 3TB. Log file grown to 1TB.

    Here are some clarifications I am looking for :

    When transaction log is full,

    1) What happens for the incoming transactions for that particular database?

    2) Are there any side affects when always on and log shipping is configured for this particular database ?

    Thanks,

    Bob

  • Is there a proper way of capturing sql stmts which are causing the log file to grow abnormally ?

  • I can answer question 1 for you - other transactions on that database that select data may have no issues.  Ones that change data will fail.

    For question 2 - I am not sure as I don't have those features turned on and thus haven't researched the limitations on them.  As log shipping doesn't send across the LIVE transaction log, but the tlog backup, I expect that a full transaction log is going to have no effect on log shipping except that once that transaction is cleaned up, you are going to have a HUGE transaction log backup file to push across the network.  AlwaysOn essentially relies on shared disk for it to work (if my understanding is correct) so if the disk is full on server A, it is going to be full on server B so having it fail over to secondary won't solve any problems but shouldn't cause any new ones.

    The fun part about your problem now though is that a 2 day running query is going to be a pain in the butt to roll back (may take another 2 days to roll that back AND you are going to need some disk space for it).  I would check out this article:

    https://www.sqlservercentral.com/articles/full-transaction-log

    as it MAY help you get some transaction log space back so you can roll that long running query back.  I would definitely be looking into what caused it to run so long.  I would also be looking into some sort of disk free space monitoring and alerting tools.  I like to be alerted when I have 15% disk space free so I can investigate and remedy without surprises.

     

    EDIT - changed the "will" in the first sentence to a "may" as it isn't guaranteed to run successfully and depends on a few other factors.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • What a mess!

    (1) as above

    (2) yes; you might consider breaking those links until after you're cleaned this up.  Then again, that has its own issues as well.  What a mess!

    And to add:

    (3) How do you get the db working again?

    If all the UPDATEs from the last two days are all part of the same transaction, none of your options are good.  If at all possible, you need to find some way to stop the long-running trans without having to cancel it.  If it's a single UPDATE statement, that's likely not possible.  Rollback is typically twice as long as the original modification, so you're looking at roughly a 4-day rollback.

    Ultimately you'll likely need to add another log file on a different drive.  But, if the UPDATE is still active, that could add to the problem.  The UPDATE is likely sleeping now because it can't log activity.  When you add a log file, it will be able to run again.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Hi All,

    Thanks for sharing your inputs. Some observations before we resolved the issue.

    The log_reuse_wait_desc was initially showing "Active_Transaction". Then we had to check with the application team and saw some of the spids are running over 4 days with sleeping status and open_tran = 1. So, we had to kill those spids. Then it showed up LOG_BACKUP , we have taken the log backup, then shows "AVAILABILITY_REPLICA".

    I checked the AG Dashboard on all the 2 replicas and same it was all green. Meaning everything was in Sync. I even checked by add/remove columns in the AG Dashboard to check to see if any sendqueue is there or not. I dont see any lag.

    But question is, it was showing up AVAILABILITY_REPLICA" log_reuse_description for almost more than 5 hours. I have no clue what as going on. Is there a way to tell if some sync is happening ? In AG Dashboard it is all green and not delay. Don't know what was exactly happening during that 5 hours. is there a dmv or query to check if anything is pending or something was happening during those 5 hours ? We didnt see any errorlogs though. Eventually, we had to failover to the other replica to check to see any issue. Even then it was showing 'AVAILABILITY_REPLICA' for more than 40 mins and then it allowed us to SHRINK the log file to release some space to OS.

    The log drive is separate 2TB disk. usually the log size used to be 350GB , all of a sudden it has grown to 2TB filling entire disk space.

     

    Attachments:
    You must be logged in to view attached files.
  • There are 2 processes you need to check - the send queue and the redo queue.  The redo queue on the secondary has to apply all of the transactions to the destination databases before the transaction log can be cleared.

    Those 5 hours are almost certainly related to the redo queue being applied on the secondary.

    In normal operation (no network issues) - the send queue will take less time than the redo queue.  The send queue only has to send the data to the secondary - where the redo queue has to hold that data until the transaction can be applied and hardened.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Hi Jeffrey,

    Thanks for the explaination.

    Is there a DMV or a way to confirm that redo is happening ?

    -Bob

     

  • I would start here - this article explains how to setup a policy to monitor for RTO/RPO and alert if exceeded.

    https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitor-performance-for-always-on-availability-groups?view=sql-server-ver15

    This article shows how to capture data using Extended Events:

    https://techcommunity.microsoft.com/t5/sql-server-support/troubleshooting-redo-queue-build-up-data-latency-issues-on/ba-p/318488

    I would also setup AG specific alerts - which can be done using a script like this:

        Set Nocount On;

    Declare @alertName sysname
    , @thisErrorNumber varchar(6)
    , @sqlCommand nvarchar(max) = ''
    , @operatorName sysname = 'Your Operator Name Here';

    Declare @errorNumbers Table (ErrorNumber varchar(6), AlertName varchar(50));

    Insert Into @errorNumbers
    Values ('1480' , 'AG Role Change (failover)')
    , ('976' , 'Database Not Accessible')
    , ('983' , 'Database Role Resolving')
    , ('3402' , 'Database Restoring')
    , ('19406', 'AG Replica Changed States')
    , ('35206', 'Connection Timeout')
    , ('35250', 'Connection to Primary Inactive')
    , ('35264', 'Data Movement Suspended')
    , ('35273', 'Database Inaccessible')
    , ('35274', 'Database Recovery Pending')
    , ('35275', 'Database in Suspect State')
    , ('35276', 'Database Out of Sync')
    , ('41091', 'Replica Going Offline')
    , ('41131', 'Failed to Bring AG Online')
    , ('41142', 'Replica Cannot Become Primary')
    , ('41406', 'AG Not Ready for Auto Failover')
    , ('41414', 'Secondary Not Connected');

    Declare cur_ForEachErrorNumber Cursor Local fast_forward
    For
    Select *
    From @errorNumbers;

    Open cur_ForEachErrorNumber;
    Fetch Next From cur_ForEachErrorNumber Into @thisErrorNumber, @alertName;

    While @@fetch_status = 0
    Begin

    If Not Exists(Select *
    From msdb.dbo.sysalerts s
    Where s.message_id = @thisErrorNumber)
    Begin

    Execute msdb.dbo.sp_add_alert
    @name = @alertName
    , @message_id = @thisErrorNumber
    , @severity = 0
    , @enabled = 1
    , @delay_between_responses = 0
    , @include_event_description_in = 1
    , @job_id = N'00000000-0000-0000-0000-000000000000';

    Execute msdb.dbo.sp_add_notification
    @alert_name = @alertName
    , @operator_name = @operatorName
    , @notification_method = 1;

    Raiserror('Alert ''%s'' for error number %s created.', -1, -1, @alertName, @thisErrorNumber) With nowait;
    End

    Fetch Next From cur_ForEachErrorNumber Into @thisErrorNumber, @alertName;
    End

    --==== Close/Deallocate cursor
    Close cur_ForEachErrorNumber;
    Deallocate cur_ForEachErrorNumber;

    To view the redo queue - open the dashboard on the secondary and add the redo queue size and redo send rate columns.  You can open the dashboard on the primary and that should show both primary and secondary.

    You can use the DMV sys.dm_hadr_database_replica_states to view the replica states - but the above will alert you to an issue and allow you to review and address before it becomes a large problem.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply