Article by Paul Ibison (www.replicationanswers.com)
Introduction
Occasionally in the replication newsgroups questions come
up regarding transactions in transactional replication. People want to know what
happens if a transaction on the publisher fails - does it still get applied at
the subscriber? Or what if it runs on the publisher and fails on the subscriber
- does it always/ever roll back? Then there are questions regarding the
transaction size. If an update statement affects say 100000 rows on the
publisher, this will be logged as 100000 updates in the transaction log. In some
such cases the log reader agent may not be able to cope, and even if the log
reader succeeds, there may be similar problems for the distribution agent. Users
are interested in the possible workarounds for these situations. There is only a
little information in BOL to help clarify and solve these types of query, so I
have tied my various findings into an article to hopefully help out.
Replication of Simple Transactions
We start with a simple case. Suppose we add 4 records in a
transaction at the publisher.
set xact_abort on
begin tran
insert 1
insert 2
insert 3
insert 4
commit tran
The log reader agent reads the transaction log and adds these records to the
distribution database in the MSrepl_commands and MSrepl_transactions tables. If
we use sp_browsereplcmds to check the actual text of these inserts we see that
the transactional nature of this batch is maintained and we'll see something
like this:
xact_seqno | Command | Command_Id |
seq_aaa | Command1 | 1 |
seq_aaa | Command2 | 2 |
seq_aaa | Command3 | 3 |
seq_aaa | Command4 | 4 |
The constant xact_seqno identifies the transaction, and
the incrementing command_id indicates the commands in the batch. So,
transactions on the publisher are applied as
transactions on the subscriber.
Failing Transactions on the Publisher
What happens to the replication setup when a transaction
fails at the publisher - does it still get replicated? Suppose we perform the
same transaction as above, but "Command4" fails. On the publisher there
will be an error message, the transaction is rolled back because of the
xact_abort setting and no rows are added. Using sp_browsereplcmds, none of these
commands are visible and they are not to be found on the subscriber, showing
that only committed transactions are replicated to the
subscriber.
Failing Transactions on the Subscriber
What if the transaction completes at the publisher and
subsequently fails at the subscriber? This could occur because we haven't been
vigilant, and someone has been allowed to edit the subscriber data (yes - it
happens :)). In this case the distribution agent fails and raises an error
message. The failure itself is expected, but the inbuilt error handler has a
hard coded exception command: "if @@trancount > 0 rollback tran"
(which is actually independant of the xact_abort setting on the publisher) meaning that
transactions failing at the subscriber are automatically
rolled back. So, until the problem is fixed, the error will prevent
the distribution agent from processing any separate part of the transaction -
the ACID properties are maintained. Note that this is the default behaviour
which may be overridden (see later).
Large Transactions which cause Log Reader Agent Timeouts
The next thing to look at is the issue of large
transactions. Remember that all commands are implicitly run as transactions so
although a TSQL statement might look innocuous at first glance - eg a
simple update statement - if it updates 100000 rows, this will be logged as:
begin tran
update row 1
update row 2
update row 3
...
update row 100000
commit tran
The log reader agent will try to read each of these commands and transfer them
to the distribution database. During this process, what problems can occur?
Basically we might have some sort of timeout of the log reader agent. The
resulting messages are various and include:
"The process could not execute 'sp_replcmds' on 'xxxxxx'."
"Status: 2, code: 0, text: 'Timeout expired'"
"A time out occurred while waiting for memory resources to execute the query."
"Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job
history in the Jobs folder for more details."
"The step was cancelled (stopped) as the result of a stop job request"
These errors are each related to the number of records marked for replication in
the log which the log reader has to parse and process. There are a few profile
parameters which can be used to modify the process and avoid the error:
(a) increase the QueryTimeout. This is
the usual solution. It won't decrease the latency of a big transaction, but it
is often sufficient on its own to get things working again.
(b) change the MaxCmdsInTran. If the problem is that the
transaction contains a hugh amount of commands, then it could be an advantage to
chop it up into several smaller transactions. This will massively decrease
latency and prevent associated timeouts. However it will cause a problem if
there is a command within the transaction which fails. Remember that a
transaction gets implicitly rolled back on the subscriber when there is an
error. If we use this MaxCmdsInTran parameter, the child batch which has the
problem row will error causing the distribution agent to fail. However if there
are previous
child batches they will have already been committed and hence broken the ACID
properties of our original transaction. Therefore there is a definite risk
associated with using this parameter.
(c) change the ReadBatchSize. This is the maximum number of
transactions read out of the transaction log of the publishing database "per
processing cycle". As there is no supported way to configure the "processing
cycle" (changing -LogScanThreshold is going to be unsupported) we can equate
this to saying "it is the number of transactions read when running the log
reader agent". The lowest this figure can be is 1, so if there is one huge
transaction causing the problem, this setting will not help at all to remove a
timeout. However if the timeout is as a result of a lot of small transactions
being processed, setting this to a smaller figure than the default of 500 to
temporarily clear the backlog, along with increasing the QueryTimeout, is known
to remove the problem.
(d) change the ReadBatchThreshold. This is very similar to the -ReadBatchSize
parameter above but rather than relating to transactions, it relates to commands ie it
is the number of replication commands to be read from the transaction log in a
processing batch. Likewise, setting this to a small value along with increasing
the QueryTimeout can help remove a backlog, especially if the problem is caused
by large transactions.
Large Transactions which cause Distribution AgentTimeouts or Slowdowns
Just like with the log reader agent we can use the QueryTimeout parameter,
and increasing it's value will be the first option to try. In the case of the
distribution agent though we have 2 extra parameters: CommitBatchSize which is
"the number of transactions to be issued to the Subscriber before a COMMIT
statement is issued. The default is 100." and CommitBatchThreshold
which is "the number of replication commands to be issued to the Subscriber
before a COMMIT statement is issued. The default is 1000." - both definitions
coming from BOL.
These definitions are really quite confusing. Does this mean that the ACID
properties are necessarily broken if we have a transaction that has > 1000
commands in it? What is the "COMMIT" being referred to? Empirically I find that
is that this is not an option to break up a transaction into a series of smaller
transactions, so the word "COMMIT" in the BOL explanations is misleading and
really doesn't refer to the original transaction (or any transaction!).
Consider an example where 4 rows are added and the 4th row causes an error:
begin tran
insert row 1
insert row 2
insert row 3
insert row 4 -- causes an error
commit tran
If we set the CommitBatchSize and
CommitBatchThreshold to both be 2, and the final insert fails then what
happens? If the ACID properties were broken and replication had produced 2
transactions at the subscriber, there would be 2 rows from the initial
successful transaction at the subscriber. In practice no rows at all are
inserted at the subscriber. So, these 2
parameters are purely ways of configuring throughput for the distribution agent,
and the logic of the original transaction remains. In other words, the "COMMIT"
in the 2 BOL definitions above refers to the application of a batch of statements
and is nothing at all to do with transactions.
It seems to me that that the usefulness of these 2
commands relates to those cases involving blocking issues at the subscriber.
Smaller batches incur reduced locking to be applied in each separate batch at
the subscriber. If there was no subscriber access other than that of the
distribution agent, the transaction will overall take longer to apply if we use
small batches. However in the presence of other processes needing to perform
reads or updates of committed data at the subscriber (ie I'm not referring to
dirty reads) then blocking issues might mean that smaller batches will overall
work more quickly. Clearly you'd need to really understand the usage profile of
the subscriber to determine if these parameters are going to improve throughput
and concurrency. In practice I've never used these parameters but I suppose I
know they're there if required.
Conclusions
Hopefully this article helps clarify how transactions are
applied in transactional replication, and how this knowledge can be used to
optimise the system