Always On/Failover Cluster Issues

Question

Post reply

Always On/Failover Cluster Issues

tjcavins

Old Hand

Points: 380
More actions
June 3, 2021 at 11:06 am

#3891411

We've made the switch to SQL Server 2019 and have implemented Always On Availability Groups. We have two nodes (Primary and secondary SQL Servers) and a file share witness. Both primary and secondary SQL Servers are on VMs and they are the only thing on those VMs.
Everything seems to be going great until we migrated our Datawarehouse SSIS project. There's one step that randomly causes the following error to appear:
SQL Server hosting availability group 'xxxxx' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
The step in question loads over 17 million records and can take 12 or more minutes to run. Due to 3rd party constraints, we have to reload this piece of the warehouse in it's entirety every night. It's not possible to get only updates, inserts and deleted records. Sometimes this piece runs fine and when it does fail, we are able to re-run the package from the point of failure without issue so it runs fine after that.
We've increased the session timeout from 10 seconds to 20 seconds and last night we increased the Lease timeout from 20 to 30 seconds and the Heath Check Timeout from 30 to 40 seconds after if bombed again.
We are checking with our network team to see if anything else is running on the VMs for backups etc in the timeframe that this step generally runs in.
What else should we be looking at?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 1

There are a lot more questions that need to be researched and answered:

Are your secondary's set to read-intent or read-only?
What is the source and destination for that SSIS project?
What node of the cluster is that SSIS package running on?
Are you using the Integration Services Catalog - if not, how is the package deployed and configured?

Additionally, extracting and loading 17 million rows shouldn't take 12 minutes on a decently sized system. I would focus on identifying what is taking so long to either extract that data - or load it. Some options to review:

In the source - are you extracting directly from a table/view or through a SQL Command. If directly from the table/view then change it to a command. There are known performance issues with directly extract a table/view.
In the destination - did you change the commit/batch sizes or did you leave them as the default? If using the default - the load will be done in a single batch with a single commit which can (and usually does) take much longer. Try setting a reasonable batch/commit size - for example 200,000 and see if that improves the performance.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

tjcavins Old Hand Points: 380 More actions · Answer 2

Secondary is non-readable. Sole purpose is for fail over and making it the primary.

Source and desitination are SQL Server.

Package is running on Server A which is the primary server.

Yes, packages are in the Integration Catalog.

We are using a SELECT command (INSERT INTO ... SELECT FROM Staging_Table A etc

I'm not aware of setting a batch size when doing INSERT INTO... SELECT FROM

e4d4 SSCertifiable Points: 5777 More actions · Answer 3

Hi, check resource utilization especially IO (disk latency, disk queue length), CPU or SQL dumps. It is often caused by SQL extremely high IO resource utlizaiton, so high that Windows has problems with other processes (eg. can't rdp to server or respond from cluster). Most time it is disk queue read length overloaded by SQL during eg. backup configured to too many files, dbcc checkdb, or query that is doing many reads.

If so then check if there are any other heavy processes that are running same time or find some way to optimize this process.

Whole process is just load stg_tag and insert into tab from stg_tab no data validation etc.?

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 4

tjcavins wrote:

Secondary is non-readable. Sole purpose is for fail over and making it the primary.
Source and desitination are SQL Server.
Package is running on Server A which is the primary server.
Yes, packages are in the Integration Catalog.
We are using a SELECT command (INSERT INTO ... SELECT FROM Staging_Table A etc
I'm not aware of setting a batch size when doing INSERT INTO... SELECT FROM

Is the cluster the source system - or the destination?

Are the source/destination systems in the same data center - same VM farm?

Now, you stated you are using an INSERT INTO? If that is correct - why are you doing that? You should be using an OLEDB Source - connected directly to the source system and an OLEDB destination, again connected directly. On the OLEDB destination, you need to set the batch/commit sizes and not use the default values.

If you are actually using INSERT INTO - it seems you are using SSIS to run code that pulls data across a linked server. That just doesn't make any sense - if you are going to do that then why involve SSIS?

Since you are using the catalog - you can review the reports for the project and see what step is taking the most time. Once you have that you can start reviewing the project to determine what can be done to improve the performance.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs