Transactional Replication Performance Issues After Migration From 2000 to 2008

Question

Transactional Replication Performance Issues After Migration From 2000 to 2008

chris.roddis-ferrari

Mr or Mrs. 500

Points: 566
More actions
November 13, 2013 at 2:18 pm

#286150

We currently have 140 SQL 2000 publications replicating to a single 2008 server with about 25 articles. 5 of the articles are quite substantial. We are in the process of migrating to new 2008 servers that perform in general significantly better than the old 2000 infrastructure. Previously all replication ran fine with a latency of no more than a minute at any time even under peak load. We have migrated about 60 servers to the new 2008 infrastructure but the latency has now shot up. This makes no sense in terms of performance. We have tried a number of things to resolve this but have so far been unsuccessful. The 2008 and 2000 publications are both going to the same DB and using the same replication procs.
First the @status in add article needed changing from the original script where it was 0 to 24. This gave a significant improvement but the latency is still up to 40 minutes at peak times.
We have tried changing from push to pull. This made the performance worse.
We have changed the PollingInterval on the distribution agent from 5 (2008 default) to 10 (2000 default). This made no noticeable difference.
We have changed the ImmediateSync setting to 0 from 1. This made no noticeable difference.
We have ensured the index etc is ok on the central MSreplication_unscriptions table. This made no noticeable difference.
We have tried lock hints on some of the replication procs. This made no noticeable difference.
Any ideas would be much appreciated

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 1

For reference all changes except the @status and PollingInterval have been made at 2 or 3 of the 60 new servers to test, not at all 60.

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 2

Have you established where the latency is?

Log reader to distribution server?

Distribution agent to subscriber?

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 3

Cheers for the response. The publisher is its own distributor, there isn't a separate server for this. I will post some outputs from the Distribution and Log Reader agents shortly

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 4

MysteryJimbo

SSC-Insane

Points: 24203

November 14, 2013 at 3:01 am

#1666306

Also, is the latency across all subscribers?

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 5

Some are worse than others, but there is latency across all the new servers. There was a correlation to the number of records in msrepl_commans and performance. But even when we set ImmediateSync to 0 and running the Distribution Cleanup job every 30 mins to keep the table size, this didn't help with I guess suggests it is an issue with applying the commands at the subscriber rather than getting the off the distributor?

There is no latency on the 2000 boxes

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 6

chris.roddis-ferrari (11/14/2013)
Some are worse than others, but there is latency across all the new servers. There was a correlation to the number of records in msrepl_commans and performance. But even when we set ImmediateSync to 0 and running the Distribution Cleanup job every 30 mins to keep the table size, this didn't help with I guess suggests it is an issue with applying the commands at the subscriber rather than getting the off the distributor?
There is no latency on the 2000 boxes

It could be.

For clarity, you have 200 publishers (140/60 2000/2008) delivering to a single subscriber using push transactional replication. All of the 60 2008 publishers are experiencing latency at a currently unknown "bottleneck".

Are the subscriptions going to the same database/objects?

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 7

Yes all going to the same database/objects. The split is 80 on 2000 and 60 on 2008

Cheers

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 8

Any Drive(disk) level changes happened ? like comparatively low graded disk is being used now.

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 9

Disk is now significantly better

Was previously 2 Utlra SCSI 420 72GB drives RAID1-0 and is now 4 SAS 300GB drives 2 RAID1-0 pairs.

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 10

Distribution Agent Log

************************ STATISTICS SINCE AGENT STARTED ***********************

11-14-2013 13:20:45

Total Run Time (ms) : 394605 Total Work Time : 389457

Total Num Trans : 5194 Num Trans/Sec : 13.34

Total Num Cmds : 8928 Num Cmds/Sec : 22.92

Total Idle Time : 0

Writer Thread Stats

Total Number of Retries : 0

Time Spent on Exec : 25784

Time Spent on Commits (ms): 1622 Commits/Sec : 0.13

Time to Apply Cmds (ms) : 389457 Cmds/Sec : 22.92

Time Cmd Queue Empty (ms) : 157 Empty Q Waits > 10ms: 10

Total Time Request Blk(ms): 157

P2P Work Time (ms) : 0 P2P Cmds Skipped : 0

Reader Thread Stats

Calls to Retrieve Cmds : 2

Time to Retrieve Cmds (ms): 369629 Cmds/Sec : 24.15

Time Cmd Queue Full (ms) : 19843 Full Q Waits > 10ms : 128

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 11

It looks like the delivery of commands is whats taking the time. Have you compared the distribution agent profile settings between the servers?

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 12

Only differences are BCPBatchSize and QueryTimeout which we haven't changed and PollingInterval which changed to 5 by default and we have changed back to 10

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 13

chris.roddis-ferrari (11/14/2013)
Only differences are BCPBatchSize and QueryTimeout which we haven't changed and PollingInterval which changed to 5 by default and we have changed back to 10

None of those would make any difference to command delivery. Have you checked for blocking on distribution db and subscription db?

Checked latency using a tracer token?

These are the parameters which modify delivery rate.

[-CommitBatchSize commit_batch_size]

[-CommitBatchThreshold commit_batch_threshold]

[-MaxDeliveredTransactions number_of_transactions]

[-PacketSize packet_size]

[-SubscriptionStreams [1|2|...64]]

chris.roddis-ferrari Mr or Mrs. 500 Points: 566 More actions · Answer 14

Tracer Token show Publisher to Distributor (same box) as a couple of secs and all the latency is Distributor to Subscriber - 12 minutes in the one I just did. There is no blocking on the distribution db, just a ASYNC_NETWORK_IO wait of about 2 secs on the distribution process. There has always been a level of blocking on the subscriber even before the upgrades started, but this has never caused performance issues. I am not able/don't know how to tell if this has increased