April 23, 2024 at 8:11 am
Long time listener, first time caller.
I have a problem that my usual bag of tricks isn't helping with that hopefully someone else might have some insight into.
I am running SQL Server 2017 Enterprise CU31-GDR on a Windows 2016 Datacenter server with 8 cores and 64gb RAM, hosted on-prem in VMWare. There are two boxes of this nature in two disparate datacentres, and WSFC based Availability Groups providing HA.
I receive a run of three TDE encrypted database backup files from a third party vendor each day - a full on a Saturday, and a differential each other day - and restore each of those onto a named instance with Latin1_General_BIN2 collation (don't ask!) on the Primary of the AG, then on the Secondary in turn.
Let's call the databases A, B and C.
A is a single 858gb data file, with 48gb CDC data file and 114gb transaction log with a decent but not problematic amount of VLFs (again, this is a third party vendor provided database and certainly not something that would be considered best practice in its structure).
B is about half that size, and C is about half that again. Both have their own issues but that is another story.
This restore process has been working nicely (or as nicely as could be expected given the obvious shortcomings) for three years, but last week, I ran into some trouble.
I received the full backups on Saturday and differentials on Sunday and Monday and restores for all 3 databases went smoothly.
When the Tuesday differential for A was restored however, SQL threw errors 824 and 825 and Severity 021, then started BugCheck dumps every second until I was able to restart the instance to quieten it down.
The text of the errors indicates consistency and/or disk issues:
SQL Server detected a logical consistency-based I/O error: decryption failure (expected: 0x693a0592; actual: 0x47d00434). It occurred during a read of page (1:8088) in database ID 6 at offset 0x00000003f30000 in file 'A.mdf'
and
SQL Server Assertion: File: <page.cpp>, line=2670 Failed Assertion = 'GetGhostRecCount () > 0'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.
and
Rollback encountered a page with a log sequence number (LSN) less than the original log record LSN. Could not undo log record (72756:56484:766), for transaction ID (0:-1612785671), on page (1:30160152), database 'A' (11:0). Page information: LSN = (72756:56484:735), type = 11. Log information: OpCode = 4, context 11. Restore or repair the database.
Running DBCC shows massive amounts of errors, and ultimately puts the A database into Suspect mode on the Primary box.
The restore fails identically on the Secondary box, even though they are working independently of each other at this point and share no hardware.
The B and C databases restore fine however, and likewise, restoring only A up to the Monday differential continues to work fine.
The vendor said that they could not see any issues on their side, and we went back and forth over a few days getting new full backups, retrying differentials, verifying file checksums, checking the validity of the backup with restore verifyonly, restoring onto other servers, adding more memory, running ChkDsk and vMotioning onto other hypervisors and storage, all without any luck.
On Friday, that day's differential was able to be restored successfully as if by magic, and all was back to normal over the weekend and the Monday of this week... but Tuesday again blew up in the same way as last week.
I have at least been able to restore the differential onto a newly built Windows 2019 box with SQL Developer 2017 CU31 this time which does lead me to think that there must be something very bizarre somewhere that isn't jumping out and biting me on the boxes where I really need the databases restored onto.
Has anyone else seen anything like this? Can anyone think of any other troubleshooting options?
Thanks in advance!
April 23, 2024 at 8:44 am
If it bugchecks I would considering contacting MS Support, since you are allready on the latest CU.
April 24, 2024 at 6:04 pm
need more info on the source A database, anything obvious in the logs?
Here are some articles on 824 and 825 error that may help
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" đŸ˜‰
April 24, 2024 at 7:21 pm
April 25, 2024 at 12:29 am
Thanks all for the replies.
The differential has now restored successfully on two successive days. I am not confident that the root cause has been resolved, but at least it is a significant improvement on previous.
As the differential was able to be restored onto a new machine per the original message, I did a repair install of SQL against the instance where the database resides on both nodes of the AG.
I also spotted that the managed service offsite backup vendor that is in use had changed their scheduling slightly in the last week, and that their backup schedule was getting exceptionally close to my restore schedule, and on a couple of days was overlapping. They use Veritas NetBackup agents to backup databases to removable virtual media, and VMWare snapshots (without quiescence, so there should be no service interruption) of the boot volume. Maybe their backup and my restore of the full were somehow changing the LSN and confusing the differential restore, leading to corruption??? Maybe even without quiescence there was still enough of an interruption to disk access to cause read and write issues??? Regardless, I have got them to radically change their scheduling for these servers so there is no chance of their backups ever getting close to my restores in terms of time going forward. I think that I am leaning towards this as being a potential root cause.
In terms of the CDC, the actual data that I am restoring will not change - this is solely for the third party provider's benefit - and there isn't a particular need for CDC to be available on my side of the fence given the use case. Databases A, B and C all have CDC enabled on the vendor's side and all have been able to be restored on my side without issue up until last week. I haven't needed to have keep_cdc set in the restore process to date, but it and the capture and cleanup jobs are definitely something that I will consider going forward.
In terms of the A database restore, there is nothing specific logged beyond what was mentioned is the original message. The BugChecks show "Exception raised, major=38, minor=64, state=1, severity=23" with the short stack dump showing entry points in KERNELBASE, sqllang, sqldk, sqlmin, KERNEL32 and ntdll.
When the issue occurs, DBCC shows a lot of consistency issues of a similar nature, but nothing really jumps out as being a definitive cause. Examples:
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 6 pages from (1:8088) to (1:16175). See other errors for cause.
The Index Allocation Map (IAM) page (1:12928050) is pointed to by the next pointer of IAM page (1:11247326) in object ID 60, index ID 1, partition ID 281474980642816, alloc unit ID 71776119065149440 (type LOB data), but it was not detected in the scan.
Table error: Object ID 60, index ID 1, partition ID 281474980642816, alloc unit ID 71776119065149440 (type LOB data), page (1:102760846). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -13.
Table error: Object ID 99, index ID 0, partition ID 0, alloc unit ID 6488064 (type System allocation data), page (1:26949216). Test (hdr->m_freeCnt == freeCnt) failed. Values are 2 and 7896.
Extent (1:3288072) in database ID 6 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
When the differential is working fine as it currently is, DBCC shows no issues at all.
The transaction log for database A is currently 114 GB and 8.16% used. It is comprised of 368 VLFs, with 65 in use totaling 9.5GB.
The single data file for database A is currently 865GB (it has grown 7GB since the original message), and the CDC data file is 61GB (it has grown 13GB since the original message).
The vendor supplying database A have the data and logs stored on the same disk and path, whereas I have them on different disks (both at a Windows and VMWare level), so restoring them requires using move on my side. Per above, they provide a full on a Saturday and a differential each other day.
As the data isn't changing on my side, each day I put the database into single user mode with immediate rollback to force any open connections closed (there generally aren't, but better safe than sorry), then restore the full with norecovery, replace and move of the individual components, then for each day other than Saturday, restore the most recent differential with norecovery, replace and move, then put the database back into recovery and multi-user mode and set my permissions.
The logic for this hasn't changed for three years, and is identical for databases A, B and C, with B and C continuing to work in the same manner whilst A is problematic.
It is a very bizarre scenario, but at least it has been working again for the last two days. Hopefully it stays that way!
Thanks again all for your replies.
April 25, 2024 at 10:43 am
I also spotted that the managed service offsite backup vendor that is in use had changed their scheduling slightly in the last week, and that their backup schedule was getting exceptionally close to my restore schedule, and on a couple of days was overlapping. They use Veritas NetBackup agents to backup databases to removable virtual media, and VMWare snapshots (without quiescence, so there should be no service interruption) of the boot volume. Maybe their backup and my restore of the full were somehow changing the LSN and confusing the differential restore, leading to corruption??? Maybe even without quiescence there was still enough of an interruption to disk access to cause read and write issues??? Regardless, I have got them to radically change their scheduling for these servers so there is no chance of their backups ever getting close to my restores in terms of time going forward. I think that I am leaning towards this as being a potential root cause.
think it has a lot to do with it, well done ands thanks for sharing.
In terms of the CDC, the actual data that I am restoring will not change - this is solely for the third party provider's benefit - and there isn't a particular need for CDC to be available on my side of the fence given the use case. Databases A, B and C all have CDC enabled on the vendor's side and all have been able to be restored on my side without issue up until last week. I haven't needed to have keep_cdc set in the restore process to date, but it and the capture and cleanup jobs are definitely something that I will consider going forward.
I've worked with CDC loads, confident it wasn't that, in any case when restoring to a new server without Keep CDC clause all metadata is removed.
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" đŸ˜‰
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply