locking issue with the replication queue table for updateable subscriptions

Question

locking issue with the replication queue table for updateable subscriptions

theflyingdba

Mr or Mrs. 500

Points: 555
More actions
December 17, 2018 at 3:29 pm

#382489

Hi,
I have inherited a SQL estate in a recent new role and have come across something I have never had an issue with before.
We run 2 way bi-directional transactional replication using queued updateable subscriptions, implemented a few years ago as I understand it, to split front and back end traffic across 2 instances for HA etc.
Very recently, we have been experiencing timeouts on the front end when trying to connect to SQL and run certain processes. After a lot of deep dive investigation, using profiler, the blocked process report and various DMV's and logging in replication... it appears as though we have hit a very unique problem. Basically, the transaction flow goes like this...
Table A on server B is part of the 2 way replication... changes to which are picked up by the continually running queue reader agent on server A, which is also the publisher for this replication and also acts as the push distributor (I know this is not ideal, but was in place when I started). When a record is inserted into table A on server B, a corresponding record is inserted into the msreplication queue table, to be picked up by the queue reader and inserted into the mirror of table A on server A. This insert into the queue table on server B being within the same transaction scope as the insert into table A on server B... which means, if the insert into the queue table fails, the entire transaction is rolled back. All code/transactions fired at the databases are done so via NHibernate.
What we are seeing is that records are attempting to be inserted into the queue table as a result of an insert into table A on server B, however, due to a select on the queue table holding a shared lock, which originates from the queue reader agent on server A, the insert into the queue table is being locked out, forcing a SQL timeout on the original transaction followed by a rollback and materialising as a zombied transaction in NHibernate.
The replication queue table on server B is only a couple of pages in size and runs under the read committed isolation level as do all other databases in this scenario. When you mimic this issue in a test system, you can see that both processes are clearly after the same page in the queue table... with is being so small.
So, my questions are...
Is this a known issue with queued updateable subscriptions and that experience a relatively heavy workload?
Is there any way around this behaviour given the small size of the queue table? Bearing in mind that we cannot change the replication processes or system processes in this scenario. (I have toyed with the idea of turning on read committed snapshot, as this seemed the ideal answer and needs no application changes... however, this still carried a TEMPDB overhead as well as increasing the size of each record stored... so given we are intending to move away from this current topology, was interested to see if there was something else out there that could be done).
Any help would be gratefully received.
Moving forward, I propose to re-architect the entire topology to use Always on Availability Groups under SQL 2017 with read routing to split some of the IO. This would remove our need for replication and allow for the use of a single database being written too and therefore no queued updateable subscriptions. This is a must given that this type of replication is not supported after SQL 2008R2 to my knowledge, but in the mean time... I need to get this resolved to get us through the next few months.
Lead level SQL DBA currently working for a well established Insurance organisation in London.
www.linkedin.com/in/hadenkingslandleaddba

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply

theflyingdba Mr or Mrs. 500 Points: 555 More actions · Answer 1

Just to add to this... would turning off page locks for the queue table fix this? As this is a system table, is this something that would be considered supported?

Lead level SQL DBA currently working for a well established Insurance organisation in London.
www.linkedin.com/in/hadenkingslandleaddba