replication hangs on one table

  • publisher - sql2000 sp4

    distributor - sql 2005 sp1

    subscribers - mix of sql2005 sp1 and sql 2000 sp4

    we have a database with around 300 tables and replicate it to around 10 different subscribers. some in the same datacenter and others in different offices. We have around 50-60 different publications for this. Some have 20 or so tables others have only one table if they are large tables. Tables range in size from a few hundred rows to over 20 million. Some tables replicate a few commands, others 100000 or more commands on a daily basis.

    Around 6 weeks ago we started having problems with one table. It's 1.4 million rows and replicates around a few thousand commands on a daily basis. We saw a backlog of around 150000 to 400000 commands. We had some replication issues at this time with this distributor and we traced these problems to memory errors and the replication job would not start or stop. After restarting the job it pushed all the commands through to the subscribers.

    A few weeks ago we started noticing another backlog again and this time it wouldn't clear. A few times we just ran another snapshot at night but then we started to investigate. We noticed that some people were doing mass updates of 30000 or more rows and this time they weren't going through. It would just build up and cause blocking on the subscribers along with spiking the CPU to 50% or more on a constant basis. As soon as we turn off the replication for this publication everything goes back to normal.

    Meanwhile all the other publications are happily replicating tens of thousands of rows with no problems. One thing I noticed is that these updates cause a command to transaction ratio of thousands to one where everything else is less than 10 to 1. I thought this was the cause but then the senior DBA updated around 20000 rows in another publication in one update and it went through within a few minutes.

    Microsoft is absolutely no help. They are looking through logs and they tried to tell us to upgrade our hardware but we have some ancient subscribers for other busy publications with no issues.

    Right now we are thinking that it might be corruption on the publisher table and we'll be creating a new one this weekend.

    Any other ideas?

  • did you try changing the distribution agent profile?

    http://msdn2.microsoft.com/en-us/library/ms152515.aspx

     

    MohammedU
    Microsoft SQL Server MVP

  • yes, no luck

    we also deleted and recreated it from scratch and that didn't help either

  • Are there any blocking either on subscriber and/or publisher?

    Any errors in sql log or application log..

    Run the sql profiler and see what is happening...

    MohammedU
    Microsoft SQL Server MVP

  • did some testing this weekend in production and in a test environment with a copy of the db from last week.

    in both cases when replicating the table in question we encountered the same issue and replicated the problem. profiler says that when the commands are applied at the subscriber there are 10300 reads per write.

    other tables don't seem to have this issue. in this table we switched form stored procs to sql statements a few weeks back to troubleshoot this issue and try to figure out where the updates were coming from. we tested with a temp table and same data and it still took a long time. going to retest with a profiler session

  • this is strange

    i use a cursor to build a bunch of update commands and dump them into a temp table to simulate the commands the distributor is sending.

    through management studio they execute fast on the publisher and the subscriber and profiler session shows they go through OK. via the distributor these update commands take at least 100 times longer as shown by the profiler session.

    change the command a little to update to a different value to run on the publisher and when it replicates it then it takes forever again as shown in the profiler

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply