AlwaysOn Missing Redo Thread

  • Morning Guys,

    I have one particular database that isn't large or overly active but has stopped redoing logs on ONE of the replicas.

    We have 3 replica servers . This database lives in its own AG.

    It stopped Synchronising when it ran out of log drive space. More space was given and data movement was resumed.

    However, the log file cannot be shrunk becuase of 'availability_replica' because one of the replicas is massively behind.

    When I look at the redo statistics the last commit time is about a month ago, meanwhile the last sent is upto date.. so Logs are being transferred they just aren't being commited.

    ... from sys.dm_os_wait_stats:
    HADR_DATABASE_FLOW_CONTROL 13974 1344629 4087 5449

    And more confusingly and worryingly:

    SELECT db_name(database_id) as DBName,
        session_id FROM sys.dm_exec_requests
        WHERE command = 'DB STARTUP'

    ... doesn't return a SPID or Thread for the database in question on the replica in question. But does for the other successful ones.

    So, I am a bit confused how one causes such a thread to be spawned?  

    I am hoping someone can shed something. I do have the option of breaking the replica and recreating but I'd rather not, and I'd rather learn something from this.

    Cheers
    Alex

  • Well, curiously removing the database from the availability group caused all the redo's to start happening. Then I simply added back the database into the group. So far its working. I've got an eye on it... but for the last 90 minutes or so the redo queue is zero.

    Though saying that I do not understand why I Am not seeing a redo thread (as per the previous query) starting up on that replica. The only one I see it on is the primary/principle replica.

    So, still a bit confused and haven't learned much more about what caused the problem.

  • Is this SQL Server 2016?

    Presuming from the wait of "HADR_DATABASE_FLOW_CONTROL" and the symptoms you described it is 2016. I would be willing to bet you are running into a thread contention issue. That wait is often related to a thread contention issue and that is due to the parallel redo. Why jump to this conclusion? Well, the removal of the db from the AG fixed it. Another fix is a restart of sql services on the secondary nodes.

    I prefer not to venture into either of those bandaids though.

    If on 2016, I recommend getting on SP1 and the latest CU. There were various fixes for things such as this. If it still is a problem after that, there is a trace flag that can be used.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Oh and fwiw, here are some additional related waits for the issue I suspect you may have (you may need to capture via Extended Events.

    parallel_redo_flow_control
    dirty_page_table_lock
    parallel_drain_redo_worker

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hi Jason,

    This is a SQL Server 2014 (Enterprise) box. Has been running fine for 18 or so months and suddenly that on our smallest database, ironically.

    I'll keep an eye on those XE counters you mention, cheers!

    Do you observations only pertain to SQL Server 2016?

    Cheers
    Alex

  • Those three waits are new in 2016 and apply to the parallel redo that comes with 2016. 2014 doesn't have that particular enhancement.

    I'd be curious to see other metrics or symptoms that may have been around during your issue on 2014.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hello

    Just thought I'd loop back and post my stackoverflow question which was solved by your post here:

    https://dba.stackexchange.com/questions/175039/diagnosing-slow-always-on-commits

    Thank you very much for your help, I couldn't find anything concrete for weeks and then your post was a massive help.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply