Distribution Database commands are not purged on a regular basis with Snapshot and Transactional publications on the same database

  • SQL 2005 Enterprise SP2

    I will start off with that I am relatively new to replication as I just started using regularly within in the past couple of months.

    We have snapshot and transactional replication setup for a single database to feed a couple of subscribers.

    One subscriber only needs a few of the tables with filters and a subset of columns from the database and it only needs this data every 24 hours so we went with snapshot replication for this since it was a small subset of data and only takes minutes to create the snapshot and apply it.

    The other subscriber needs all the tables and columns on a near realtime basis so went with transactional replication for this. There are millions of commands being pushed through here throughout the day and night.

    This seems to work pretty well, but the problem we have seen since adding the snapshot publication is that between the time the snapshot has been generated and the next one is generated, the distributed commands from the transactional replication are not purged by the distribution cleanup history job. With all of these commands sitting in the database, we see an increasing decline in performance distributing new commands to the subscriber. We found this out the hard way at first because we had not generated a snapshot in a few days because it was still being tested and our distribution grew 50GB overnight and the transactional replication pretty much screeched to a halt which was quite alarming. It does appear, however, that once the snapshot is generated the next time, the previous transactional commands do get purged.

    I went through the procedures for the cleanup and it seems to me that having 2 publications on the same database isn't really accounted for.

    I searched the web for posts dealing with this issue and didn't really come across much.

    I'm wondering if anyone has seen this issue before and if they have any suggestions.

  • What is the retention period configured for your distributor? And do you have immediate_sync set for your publications?

    I'm assuming that you're facing the issue outlined here - http://blogs.msdn.com/b/chrissk/archive/2009/07/27/how-replication-setting-immediate-sync-may-cause-transactional-replication-distribution-database-growth.aspx

  • We are using the default values for the retention period. Immediate sync is false.

    Just to be clear, commands are purged just fine for the transactional repliacation by itself, but once the snapshot replication is added for that database, it seems that the purge doesn't differentiate between commands from the snapshot and those from the transactional and holds them all in the database until the next snapshot is generated.

  • hm - interesting problem.

    I'm assuming there are no errors in the distribution cleanup job history - only ideas I have are manually running the sp_MSdistribution_cleanup SP with the appropriate parameters to see if anything happens and debugging the sp_MSdistribution_cleanup SP (getting the part of the code that identifies the transactions/commands that have crossed the max retention period and which should be deleted and checking if that works properly etc).

  • Yes. It is an interesting problem and has been bothering us for a few days now.

    That is correct. There are no errors and the purging works fine when the publications are on their own...it's just when they are together that causes it to hold on to all the commands for a 24 hour period.

    I agree with the approach of reverse engineering it and I did that. It would probably be a rather large post if I detailed every bit of the flow, but basically, it treats the transactions and commands for both publications are part of a whole rather than invididually where keeping the commands for the transactional isn't necessary for the snapshot publication to function correctly.

  • Can you run the following on your distributor DB and check if the snapshot publication has got immediate_sync set to 1? The transactional publication should be default have this set to 0

    SELECT

    immediate_sync,

    allow_anonymous,

    CASE publication_type WHEN 0 THEN 'Transactional'

    WHEN 1 THEN 'Snapshot'

    WHEN 2 THEN 'Merge' END,

    *

    FROM MSpublications

    Also verify that running sp_helpdistributor results in "min distrib retention" having a value of 0?

    I tried to repro this and when I setup the snapshot it was with immediate_sync set to true. Because of this the distribution cleanup procedure seemed to assume that one publication with immediate_sync set should result in applying this setting to all publications

    The issue seems to lie with this code in the procedure sp_MSdelete_publisherdb_trans

    -- Detect if there are immediate_syncymous publications in this publishing db.

    if exists (select * from MSsubscriptions where

    publisher_database_id = @publisher_database_id and

    subscriber_id < 0

    select @has_immediate_sync = 1

    else

    select @has_immediate_sync = 0

    if @has_immediate_sync = 1

    begin

    -- if @max_immediate_sync_seqno is null, no row will be deleted based on that.

    select @max_immediate_sync_seqno = max(xact_seqno)

    from MSrepl_transactions with (nolock)

    where publisher_database_id = @publisher_database_id

    and entry_time <= @max_cutoff_time

    end

    Due to this piece of code even if my transactional replication commands were propogated to the subscriber and with immediate_sync set to 0 for the transactional publication and with a min_distretention of 0 with the distributor data was not being deleted until it was older than the max_distretention.

    If the following query returns results in your distribution DB then you might be facing the same issue:

    DECLARE @publisher_database_id int

    SELECT @publisher_database_id = id FROM MSpublisher_databases

    SELECT * FROM MSsubscriptions WHERE

    publisher_database_id = @publisher_database_id and

    subscriber_id < 0

    Once I set the immediate_sync and allow_anonymous of the snapshot publication to 0 the cleanup started working fine.

  • I went through and checked this before as noted in my second post, but just to be thorough, I went through the queries again.

    immediate_sync allow_anonymous publisher_id publisher_db publication publication_id publication_type thirdparty_flag independent_agent immediate_sync allow_push allow_pull allow_anonymous description vendor_name retention sync_method allow_subscription_copy thirdparty_options allow_queued_tran options retention_period_unit allow_initialize_from_backup min_autonosync_lsn

    -------------- --------------- ------------- ------------ -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------- ---------------- --------------- ----------------- -------------- ---------- ---------- --------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ----------- ----------- ----------------------- ------------------ ----------------- ----------- --------------------- ---------------------------- ----------------------------------

    1 1 Transactional 0 HTruckPartManagementPub Inventory - Heavy Trucking 11 0 0 1 1 1 1 1 Transactional publication of database 'HTruckPartManagementPub' from Publisher 'SQASQL05\OECPUB'. Microsoft SQL Server 0 3 0 NULL 0 0 0 0 NULL

    0 0 Transactional 0 AutomotiveInventoryMaster Inventory - Automotive 14 0 0 1 0 1 1 0 Transactional publication of database 'AutomotiveInventoryMaster' from Publisher 'SQASQL05\OECPUB'. Microsoft SQL Server 0 3 0 NULL 0 0 0 0 NULL

    0 0 Snapshot 0 PublishingMaster Claims 15 1 0 1 0 1 1 0 Snapshot publication of database 'PublishingMaster' from Publisher 'SQASQL05\OECPUB'. Microsoft SQL Server 0 0 0 NULL 0 0 0 0 NULL

    0 0 Snapshot 0 AutomotiveInventoryMaster Inventory – Automotive Master 25 1 0 1 0 1 1 0 Snapshot publication of database 'AutomotiveInventoryMaster' from Publisher 'SQASQL05\OECPUB'. Microsoft SQL Server 0 0 0 NULL 0 0 0 0 NULL

    (4 row(s) affected)

    I realize that the first transactional replication has the immediate sync and allow anonymous set to true, but this is our QA environment. In production they are ALL false and the same symptoms exist.

    The 2 publications that we are having trouble with are the transactional and snapshot on the same publisher db. Both are immediate sync and anonymous subscriptions false.

    distributor distribution database directory account min distrib retention max distrib retention history retention history cleanup agent distribution cleanup agent rpc server name rpc login name publisher type

    -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------- --------------------- ----------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------

    SQASQL05\OECPUB distribution I:\SSREPL 0 72 48 Agent history clean up: distribution Distribution clean up: distribution repl_distributor distributor_admin MSSQLSERVER

    The rentention settings are at the default as noted in my second post. 0 - 72 hours.

    For me, it doesn't even get to the sp_MSdelete_publisherdb_trans procedure before it is decided that nothing will be deleted. It appears to be in the sp_MSmaximum_cleanup_seqno procedure that this is happening.

    This loop and the code following it are where I think the issue is:

    while (@@fetch_status <> -1)

    begin

    --

    --always clear the local variable, next query may not return any resultset

    --

    set @max_agent_hist_xact_seqno = NULL

    --

    --find last history entry for current agent, if no history then the query below should leave @max_agent_xact_seqno as NULL

    --

    select top 1 @max_agent_hist_xact_seqno = xact_seqno from MSdistribution_history where agent_id = @agent_id

    order by timestamp desc

    --

    --now find the last xact_seqno this agent has delivered:

    --if last history was written after initsync, use histry xact_seqno otherwise use initsync xact_seqno

    --

    if isnull(@max_agent_hist_xact_seqno, @min_agent_sub_xact_seqno) <= @min_agent_sub_xact_seqno

    begin

    set @max_agent_hist_xact_seqno = @min_agent_sub_xact_seqno

    end

    --@min_xact_seqno was set to NULL to start with, the first time we get here, it'll gets set to a non-NULL value

    --then we graduately move to the smallest hist/sub seqno

    if ((@min_xact_seqno is null) or (@min_xact_seqno > @max_agent_hist_xact_seqno))

    begin

    set @min_xact_seqno = @max_agent_hist_xact_seqno

    end

    fetch #tmpAgentSubSeqno into @agent_id, @min_agent_sub_xact_seqno

    end

    close #tmpAgentSubSeqno

    deallocate #tmpAgentSubSeqno

    /*

    ** Optimized query to get the maximum cleanup xact_seqno

    */

    /*

    ** If the query below returns nothing, nothing can be deleted.

    ** Reset @max_cleanup_xact_seqno to 0.

    */

    select @max_cleanup_xact_seqno = 0x00

    -- Use top 1 to avoid warning message of "Null in aggregate..." which will make

    -- sqlserver agent job having failing status

    select top 1 @max_cleanup_xact_seqno = xact_seqno

    from MSrepl_transactions with (nolock)

    where

    publisher_database_id = @publisher_database_id and

    (xact_seqno < @min_xact_seqno

    or @min_xact_seqno IS NULL) and

    entry_time <= @min_cutoff_time

    order by xact_seqno desc

    As far as I can tell, the xact_seqno for the transactional pulication is not less than the xact_seqno of the snapshot publication because they happened after the the snapshot earlier that morning and this loop doesn't account for the fact that there are 2 separate publications for this database and just uses the data for the last publication for this database which is the snapshot publication.

    publisher_database_id publisher_id publisher_db publication_id article_id subscriber_id subscriber_db subscription_type sync_type status subscription_seqno snapshot_seqno_flag independent_agent subscription_time loopback_detection agent_id update_mode publisher_seqno ss_cplt_seqno

    --------------------- ------------ -------------------------------------------------------------------------------------------------------------------------------- -------------- ----------- ------------- -------------------------------------------------------------------------------------------------------------------------------- ----------------- --------- ------ ---------------------------------- ------------------- ----------------- ----------------------- ------------------ ----------- ----------- ---------------------------------- ----------------------------------

    (0 row(s) affected)

    Results from the final query.

  • ah well - it was a good theory while it lasted - I can go "hmm...interesting" again 😉

    If you think the issue is with the sp_MSmaximum_cleanup_seqno SP then use the following code to check if it returns a valid xact_seqno

    DECLARE @max_xact_seqno varbinary(16),@min_distretention int,@cutoffdate datetime,@publisher_database_id int;

    SET @min_distretention = 0;

    SET @cutoffdate=dateadd(hour,-@min_distretention, getdate());

    -- the sp_MSdistribution_delete SP uses the following query to get the publisher_database_id

    -- and use it in a cursor

    --SELECT DISTINCT publisher_database_id FROM MSrepl_transactions

    -- if it matches what is below then can go ahead with this

    SELECT @publisher_database_id = id FROM MSpublisher_databases;

    SELECT @cutoffdate AS CutoffDate;

    EXEC sp_MSmaximum_cleanup_seqno @publisher_database_id,@cutoffdate,@max_xact_seqno OUTPUT;

    SELECT @max_xact_seqno AS DeleteAllBeforeThisXactseqno;

  • Yes. This is how I was debugging and got to where I found that nothing was going to be deleted. Here is the output:

    CutoffDate

    -----------------------

    2011-06-24 13:56:30.107

    (1 row(s) affected)

    DeleteAllBeforeThisXactseqno

    ----------------------------------

    0x00

    (1 row(s) affected)

    The 0x00 obviously means there is nothing to delete and is set as the default for the variable at the end of the loop in case no records are returned in the following sql statement.

    I'm getting ready to push some transactions through our environment so I can get some more test data. I'll follow up with more details in a later post.

  • I pushed some transactions through and based on what I can tell, the issue I am seeing is due to the design of the cleanup process.

    Basically, with multiple publications on the same database, the seq number used is based all in the order of how the agents were created so if you create the snapshot after the transactional, then that determines what is going to get purged based on that sequence number. I changed the loop I spoke about in my previous post so that the transactional agent is last in the loop using an order by desc and transactions were purged.

    I am not too keen on changing the MS procedures permanently though.

  • You can post this as an issue on connect.microsoft.com and hopefully someone from Microsoft will look into this and verify if it is indeed an issue - and hopefully also suggest a fix.

  • Thanks for your help with this.

  • Just posting to keep this alive. I received only one response at the MS forum and it wasn't helpful and didn't address the fact that this definitely appears to be a bug. Hopefully, by posting again here, someone else may see this and offer some suggestions or ease my fears of modifying the standard cleanup procs.

  • I'm having the exact same issue! (see my latest reply on http://www.sqlservercentral.com/Forums/Topic1291326-291-1.aspx). If you send me the link to your connect bug report I'll gladly vote it.

    Did you ever get a fix for it, I'm the same as you, I know how to fix the system SP so it works for me, but feel it shouldn't come to that!

  • I apologize. I just saw this response.

    I never did get a Microsoft solution to the problem and I just modified the system procedure to solve the issue. Honestly, I'm not sure I can even find the connect post at this point because it has been so long ago.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply