September 19, 2016 at 3:45 am
Hi
We have recently migrated from SQL 2012 to SQL 2016 on new more powerful servers. Now we started to experience huge "Estimated Recovery Time" and "Redo Queue Size" that takes most of the day to recover from. Estimated Recovery Time is often around +30.000 seconds.
We have set up a job that monitors the "Redo Queue Size" and know that the problem appears shortly after a nightly index reorganize job starts.
We never used to se this on the old servers, so we just don't understand why this happens?
Maybe it's an SQL 2016 issue??
Setup information:
HP Proliant DL580, 4 x CPU E7-8893 V3, 2TB RAM, 2 x 2TB Intel P3700 Solid-State card. Dedicated 10gbe network for Sync.
September 21, 2016 at 7:52 am
Are you in async mode?
Can you describe your AlwaysOn setup
September 22, 2016 at 12:42 am
Hi,
The setup as follows:
There are 2 servers in the setup, placed approx 100 meters apart in 2 different buildings.
Both the client and Sync is on 10Gbe network.
Availability mode is Synchronous commit
Failover mode is Automatic, with Allow all connections and readable Secondary.
It has 12 databases in one availability group, where 2 of them are very large (4 TB)
When showing the dashboard - everything is "Green" except the it shows "Estimated Recovery Time" at 40.000 seconds and "Redo Queue Size" at maybe 80GB (varies from day to day)
Viewing varies performance tools we can not discover any form of delay. Have looked at storage and network, but nothing. It's like the secondary server is just idling doing nothing, we suspect that the data is there only it hasn't committed the data.
September 27, 2016 at 4:48 am
Have you benchmarked the disks where log used to be and where the log file is now?
Can you find out from the san admin how many channels previous log disk was and how many the new one has now.
Are the locations the same or are there more log files sharing fewer disks?
With the migration, is the index rebuild job exactly the same?
September 27, 2016 at 5:20 am
Hi,
The log drives are faster 4TB NVME Intel 3700 flash cards, so I doubt that can be the problem.
The index jobs are the same (Ola Hallengren).
In the meantime I have reestablished the cluster and restored a copy of the largest DB. We can actually force the problem on the old setup as well, although not as bad. The difference from the old backup and the current production DB is a newly introduced table that already has 2,5 billion rows.
Info: The 2 largest largest tables are both partitioned (monthly) and has 35 billion records each - these 2 tables are Reoganized every night (if more than 15% fragmented), so if this job collides with the newly introduced table, the replica becomes up to 4-5 hours behind.
We have looked intensely at storage, but we just can't se anything. It seems as if there is no strain on the system at all - yet it just doesn't seem to catch up??? :rolleyes:
September 27, 2016 at 9:04 am
September 27, 2016 at 11:17 am
Wow, checked last week - nothing. Great stuff I'll do that.
September 27, 2016 at 1:28 pm
Have you experienced same redo latency if you failover between nodes?
September 27, 2016 at 1:38 pm
Yes, it makes no difference.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply