December 15, 2009 at 7:02 pm
Hi all,
Last week I created 6 new publications with mostly reference data and stored procedures/functions. I created snapshots and everything was working as planned. The subscription articles that did contain transactional data were receiving the updates and inserts with no problem. The most active article was receiving rows at only about 5 an hour.
The next night I decided to add 2 new publications. This time they contained much larger, more active tables. Things were working fine until the next morning at about 6:30 am when our servers start to pick up with activity. All queries and transactions on the publication server were slowed right down and there was quite a bit of blocking between processes. Things were so bad I could not even delete the new publications. There were other processes that were blocking this request. I turned off the log reader agent on the two new publications and actually had to delete the distribution DB therefore losing job history. The replication monitor was reporting that the performance for these 2 new publications was "Critical" and all the others were excellent. Here is a breakdown of the articles within the publications:
Publication1
Table A
9225814 Rows
366 rows transferred to subscriber between 2:00 am and 6:30 am
Table B
16486266 Rows
500 rows transferred to subscriber between 2:00 am and 6:30 am
Table C
9222721 Rows
366 rows transferred to subscriber between 2:00 am and 6:30 am
The other publication was about the same. These tables get thousands of transactions a day but the load was fairly light when things went bad. Other important notes to mention are that I prepared these publications using SSIS and not snapshots like I did with the ones the night before because of their size. I made sure the tables were in sync, then I created the subscriptions. The other important note is that the distributor is on its own server. The crude server setup is as follows:
Server A (Publications)
to
Server B (Distributor)
to
Server C (Subscribers)
All servers are on the same domain and have a very fast connection between them. They are all good machines and are running Windows Server 2003 and SQL Server 2005 Enterprise SP2. The biggest thing that surprised me was that I was not able to shrink the transaction log files because there were 12GB in each one of the 2 databases that were not yet marked as being distributed. When cleaning everything up, I had to create a dummy publication and basically flush the log so I could reclaim drive space. It looks to me a first glance that the log reader agent was having troubles keeping up with the activity. The rows in the publications get published at peek times at about 20 per minute and most of the columns are integers and small text. I just don't see that being a problem.
Anyway, I know this is a long winded post but I wanted to see if any of the experienced replication DBAs can add some insight into potential issues here. I know I haven't provided performance data, entries from the replication log or entries from the SQL log but I will try to dig up what did not get deleted when I disabled distribution. I have been a DBA for quite a while so you don't have to worry about me understanding the concepts, I am just fairly new at replication. Any suggestions are greatly appreciated. Thanks.
Corey
December 17, 2009 at 8:01 am
Without more information it is hard to be sure, but I'd suspect that your distribution database is on the same SAN infrastructure as your publisher. We also use a remote distributor (actually one of our subscriber servers) as well, and since the log reader job runs on the distributor and only reads the transaction log from the publisher, there should be no appreciable impact on the publisher performance. UNLESS...
There is a slow network link (not the case here)
The publisher log files are sharing physical disks with other stuff
The publisher was running at full capacity (CPU-wise)
The publisher is having memory issues
In our experience, we had the first happen once, even though all servers had GB NICs, because of a cheap switch that was having issues auto-detecting and kept reverting to 10 MB.
We've also seen the second when all disks were just "out there" in the SAN, with no dedicated physical configuration control. All the bandwidth in the world doesn't help if you have spinning disks that have to keep thrashing their head from place to place to deal with competing demands.
Also check to be sure you don't have any objects in the subscribers that reference back to the publisher. You mentioned using SSIS to initialize the databases, but without details on what exactly you mean. If there are fully qualified references in triggers or procedures, you could end up with a "loop" between the publisher and subscriber.
December 18, 2009 at 12:28 pm
Thanks for the reply Ronzo. This is what I was looking for. As for the points you were inquiring about, the Publisher, Distributor and Subscriber all share the same SAN. Also, I prepared the 6 tables for replication by just a simple transfer from the Publisher to the Subscriber server using the SSIS transfer data task. I did a top up and immediately after created the subscriptions. I did not run a snapshot for these 2 publications.
Before I posted this, my first guess at what happened was a disk I/O problem where the Publisher log files live or server stress at the distributor. The only thing that still puzzles me is that the publications that I created the night before were delivering their data, and those articles live in the same DBs as the ones I created the following night. The second set of publications do have quite a bit more activity than the first but it is not like I am replicating 10000 transactions a second.
I wish I had more to provide but I lost much of the logged data that would have helped me come to a conclusion of what happened here. I am going to try to set up a similar configuration in a test environment and see if I can duplicate the problem. I am also going to mess around with some of the configuration settings of the log reader agent because I think that is where the solution most likely lies. Thanks for your reply once again and if anything else comes to mind please do let me know.
Corey
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply