Background
SQL Server replication is one of today’s most common and useful tools for replicating data. But even with some handy out-of-the-box logging and monitoring tools, it can be a black box for many DBA’s. The purpose of this article is to help shed light on how the Distribution Agent works, with the goal of identifying and solving the following:
- Replication bottlenecks
- Growing distribution database
- Why and when the “Distribution clean up: distribution” job deletes replicated commands
Distribution Agent
According to BOL, the distribution agent is the executable responsible for delivering snapshots and transactions from the distribution database to the subscriber. It’s one of the key agents involved in SQL Replication, but also one of the most difficult to monitor and troubleshoot. SQL Server provides administrators with insight into general status and performance information using the “Replication Monitor”, but what about the following:
- The state of distribution agents
- The xact_seqno of distributing transactions
- The age of distributing transactions
- The delivery rate of distributing transactions
- The number of commands in distributing transactions
- If and when a transaction is preventing the “Distribution clean up: distribution” job from deleting replicated transactions and commands
These were a few of the questions I asked myself while troubleshooting a rapidly growing distribution database and severely latent replication environment. I decided to explore the replication internals and system stored procedures like sp_MSdistribution_delete, sp_MSmaximum_cleanup_seqno, sp_MSget_repl_commands, etc. When I was finished, I combined logic from the system code into a stored procedure called get_distribution_status.
Get_Distribution_Status: The purpose of get_distribution_status is to query the current state/activity of Distribution Agents for each Publisher. Below is a screenshot of the stored procedure output, column and parameter definitions, and a couple examples of how to interpret the results.
Columns Definitions:
- publisher_db - the publisher database
- subscriber_db - the subscriber database
- dist_agent - the distribution agent
- dist_status - distribution agent status, one of these: 'Start', 'Succeed', 'In Progress', 'Idle', 'Retry', 'Fail'.
- dist_comments - distribution agent comments (status, error msgs, etc)
- xact_type - the transaction type (‘transaction’, ‘snapshot’)
- xact_article - the name of the transaction article
- xact_seqno - thetransaction sequence number
- xact_age - the transaction age
- xac_cmd - the transaction command count
- delivery_rate - the distribution delivery rate
- delivery_est - distribution estimated delivery
- cmd_pending - the commands pending delivery for the distribution agent (commands >= the oldest undistributed transaction for that agent AND < the second oldest undistributed transaction)
- cmd_tbd - the commands to be deleted by the “Distribution clean up: distribution” job. (commands < the oldest undistributed transaction)
- cmd_blk - the commands blocked by the transaction from cleanup. (commands >= the oldest undistributed transaction for all agents AND <= the second oldest undistributed transaction)
The parameters for this procedure:
- @help - displays all parameter options, types, defaults, descriptions and examples
- @publisher_db - filters result set by publisher_db
- @dist_agent_id - filters result set by agent id
- @include_pub - returns publisher column
- @include_sub - returns subscriber column
- @include_cmds - returns cmd columns
- @sort_order - returns sorted result set
- @debug - returns debug data
Example 1: The following example was generated by executing get_distribution_status on my test distribution server with the following parameters:
exec get_distribution_status @dist_agent_id=-1, @include_pub=1, @include_sub=1, @include_cmds=1
Now let’s look at how to interpret the output:
- Publication_01 - The publication is Idle, but the distribution database contains at least one un-replicated transaction with 1000 commands. Once the agents scans the distribution database, it will start delivering transaction 0x0012C0B200221ED80067, which is the oldest undistributed transaction for publication_01
- Publication_02 - The publication is actively delivering transaction 0x0012C216000D92750157 with 4000 commands, for table_02. The transaction was entered in the distribution database 19 minutes ago by the log reader
- Publication_03 - The publication is actively delivering snapshot subscription sequence 0x0012C216000F79BD0191 for table_03 with 61 commands. The snapshot subscription sequence 0x0012C216000F79BD0191 is the oldest undistributed transaction for all agents for ServerA. There are 10 thousand “distributed” commands older than 0x0012C216000F79BD0191 that can be deleted by the “Distribution clean up: distribution” job for ServerA. The snapshot subscription sequence is blocking 10 million replicated commands from being deleted by the “Distribution clean up: distribution” job. Once the publication has completed delivering the 100 pending commands, including the 61 snapshot commands, the “Distribution clean up: distribution” job will delete the outstanding 10 million commands
- Publications_04/05 - These publications have no replicated transactions in the distribution database and are idle
In this example, the long running snapshot is the potential bottleneck. Depending on your environment, 10 million additional commands in the distribution database may lead to increased latency and blocking, or it may not even be noticeable; it depends on the replication architecture, hardware and volume.
In our production data warehouse environment, we replicate a few hundred million transaction per day and start to notice a performance degradation/increased latency at approximately 150 million commands. At that volume, identifying bottlenecks before they become a problem is critical.
Using get_distribution_status in this case, we can quickly identify 1) the oldest active transaction (potential bottleneck) for ServerA, 2) distribution latency for all agents (xact_age) and 3) the number of commands/transactions to be deleted and/or blocked from cleanup.
Example 2: The following example was generated by executing get_distribution_status on my test distribution server with the following parameters:
exec get_distribution_status @dist_agent_id=0, @include_pub=1, @include_sub=1, @include_cmds=1
Let’s look at the results:
- Publication_01 - The publication is actively delivering transaction 0x0012C0B200221ED80067 for table_01 with 1000 commands.
- Publication_02 - The publication has encountered and error delivering transaction 0x0012C216000D92750157 for table_02 with 4000 commands. Transaction 0x0012C216000D92750157 is the oldest “undistributed” transaction for ServerA. There are 0 distributed commands older (cmd_tbd) than 0x0012C216000D92750157 for ServerA. The transaction is blocking 15 million replicated commands from being deleted by the “Distribution clean up: distribution” job
In this example, the potential bottleneck is transaction 0x0012C216000D92750157. According to get_distribution_status and MSrepl_errors, the distribution agent encountered and error trying to update a row in table_02 on the subscriber that does not exist. If this error is not resolved or the article dropped from replication, the transaction will continue to block the cleanup process from deleting distributed commands for publisher ServerA. In a high volume replication environment, this could lead to a server performance issue.
In my experience, performance degrades exponentially as the distribution database increases in size. The reason being, that all replication agents are competing for read/write access to a handful of system tables, like MSrepl_commands and MSrepl_transactions. When these tables increase in size, whether due to normal transaction volume or a transaction bottleneck that’s preventing cleanup, there’s a performance hit to all agents, which causes contention, which leads to latency; almost like an engine seizing.
Using get_distribution_status to monitor distribution agent activity,we can identify potential bottlenecks and errors to resolve replication issues before they become a problem.
How it Works
Get_distribution_status is essentially a reverse engineering of the replication system stored procedures, and based on the assumption that we can use historical distribution data to extrapolate current distribution activity. The publisher is at the top of the replication hierarchy, so the first step is to 1) loop through all publishers. For each publisher, it then 2) captures the most recent distributed transaction or subscription sequence number (snapshot) for each distribution agent. Lastly, it 3) loops through each distributed transaction/agent combination to find the oldest undistributed transaction in MSrepl_transactions, and the corresponding agent and transaction details.
Other Considerations
In addition to monitoring distribution agent activity, DBA’s should also consider SQL Replication configurations like min/max distribution retention and immediate sync. There’s some good articles out there on these topics, so I won’t go into detail, but these configurations also impact if and when distributed transactions are deleted.
Get_distribution_status has been tested in a SQL Server 2008 Data Warehouse environment with multiple publishers and subscribers, that distributes approx. 200-300 million transactions per day. It has not be tested against a SQL Server 2012 instance, but should be compatible. In terms of performance, runtimes can vary between a few seconds to a few minutes depending on the server, and transaction and command volume in the distribution database.
Summary
The distribution agent is a key component to the SQL Server Replication architecture. Using get_distribution_status combined with tools like the Replication Monitor, MSdistribution_status, latency monitoring, etc., DBA’s can really put their thumb on the pulse of replication and hopefully tackle some of the most common replication problems.