Suspect database on Secondary due to Log Redo issue

  • I have sqlserver 2016 in an AG with 2 secondary's for readonly and have recent cu.

    I have had multiple times in last 2 months where one of the secondary databases goes into suspect mode.

    Could not redo log record and suspend from redo are what I keep seeing

    Also, it's the same database every time.

    I enabled TF 3459 as well, but still happening.

    Any ideas on this?

     

  • First, I should clarify I don't use AG's so my advice may be way out to left field, but I do not think it is bad advice.

    Have you checked the logs?  That would be my first place to look.

    Next, I would try google to see if others have similar issues.  My googling led me to this site -

    https://support.microsoft.com/en-us/topic/kb3173471-fix-could-not-redo-log-record-error-and-replica-is-suspended-in-sql-server-2016-2014-or-2012-5085019b-19a7-97cd-e407-de2375d56cda

    where it indicates it MAY be due to a bug that is fixed in a patch.  The bug is related to shrinking the database on the primary which screws stuff up on the secondary.  If you are shrinking the primary (autoshrink or manual or scheduled task), that could be causing the problem (I also recommend NOT shrinking databases).

    Next, try to find a window to run DBCC CHECKDB on the primary and secondary systems to make sure there is no corruption.

    Next, I'd try to find a window to do a CHKDSK on the system to make sure it isn't a sign of drive failure or hardware failure.

     

    But logs are ALWAYS the first place I look to when I run into issues - SQL Server logs (Always), SQL Agent logs (if applicable), SSIS logs (if applicable) , SSRS logs (if applicable), Windows Logs (always UNLESS the SQL Server log tells me what is wrong).  Once you know what is wrong, it is much easier to figure out how to fix it.  And that is what logs are for - to capture what went wrong.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hey Brian, thanks for your reply!

    I've checked the logs and don't see anything other than what I already posted. I don't shrink anything and run checkdb everytime this has happened and no corruption as well

    We've had Microsoft involved as well and so far they can't point it out as well as our server group testing hardware.

    Just trying to find something else that might pinpoint this. The only common denominator is this always seems to happen with a large job on a very large table.

  • To confirm - it only happens on 1 of the 2 read-only secondary systems, correct?

    If so, may not hurt to investigate any differences between the 2 systems.

    Probably a shot in the dark, but what about autogrow on the secondary system?  Or free disk space?  or (probably a dumb suggestion) compatability level?

    IF it only happens on 1 of the 2 secondary systems and it is always the same secondary, I would be checking out the configs on that system.  At the database level (possibly something here, but not likely), the instance level (maybe?), OS level (maybe?), and the SAN.

    Are you able to replicate this on a second system?  The reason I ask that is if you CAN, it gives you the opportunity to do testing to see what (if anything) fixes things.

    Since Microsoft is involved, they are likely going to be the best "bang for your buck" on this one.  They have access into your system and to the internals of SQL Server.  It is likely that they will be able to fix it.

    Any chance it could be hardware related issues (disk failure, memory failure, NIC failure, server failure)?  Since it seems to happen when a large job runs on a very large table, if you have the disk space and a large file, might not hurt to try copying a large file across from the primary server to the secondary using the secondary server and doing an MD5 hash on the 2 to make sure the files are identical when it reaches the destination.  Maybe not the most efficient method of testing that, but it would test the disk and NIC as long as you could afford the bottleneck that may occur and you have enough disk to pull across a file that is large enough to be interesting for the test.

     

    On a slightly different thought, is it predictable when it will fail?  Like does it fail every time that job is run or every 5th time the job is run or anything like that?  If you can identify a pattern, it may not hurt to see what else happens during that window.  OS level schedules as well as SQL level schedules.  If there is no pattern and it seems to corrupt randomly, I would test some hardware.  CHKDSK for example or a memory checking tool.  Even watching the metrics on the server (CPU, memory, disk I/O, network I/O) to see what is happening may be interesting to note.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply