August 11, 2009 at 8:18 pm
Hi everyone,
I have been working on this for almost two months now. I have setup a mssql 2000 transactional replication, with a dedicated distributor. Publisher is running MSSQL 2000 Ent, with a SAN msa1000 (14x140GB Raid10) with an eight core xeon processor and a 16GB memory.
Distributor has two 1 Tb sata harddisk, with 1 TB dedicated for SQL 2000. Subscriber has 4x1TB SATA configured as raid0. both the distributor and subscriber has a QuadCore and 4Gig of memory.
My problem is that the subscriber cannot catch up with the publisher, after initialization of the snapshot, I get a one day latency and it is increasing. I check on the distributor using the perfmon for disk io bottleneck, disk ques are into zero level (except during distrubutor clean up which happens every 10 mins).
Distributor -> publisher latency has no problem, i get a zero or 3 seconds latency. My problem is in the subscriber. I already tried changing the commitbatchsize and commitbatchthreshold, ranging from 100 to 1000 but I do not get any improvement.
What puzzles me is that distributor is not having a hard time getting the transactions from publisher. Do I need to match the configuration of the publisher for my subscriber to catch up ?
Please help me.
many thanks.
August 12, 2009 at 1:49 am
What else is the subscriber doing?
I have seen this sort of thing before, where the subscriber was being used for reporting. The activity on the subscriber was blocking the replication process, causing huge latency.
August 12, 2009 at 12:44 pm
I have seen latency about 5 min to 10 min max. In our envirnment, 3 seconds is the max I have seen, but one day is bit too strange. Thats way too much. Seems more like the data is just being snapshotted by the daily schedule rather than the log reader sending commands.
-Roy
August 12, 2009 at 1:32 pm
How much data is being sent? I've never had the publisher and subscriber being the same spec, usually the publisher is larger. However the subscriber needs to have the hardware needed to handle that level of changes coming from the publisher. I'd check tracer tokens, especially once the snapshot is done. See exactly which transactions are coming through and how long it is taking.
August 13, 2009 at 12:55 am
Ian Scarlett (8/12/2009)
What else is the subscriber doing?I have seen this sort of thing before, where the subscriber was being used for reporting. The activity on the subscriber was blocking the replication process, causing huge latency.
Nothing. I disabled the jobs relating to reporting. I wanted the subscriber to be in sync first with the publisher, before we deal with reporting aspects.
thanks for the reply ian
August 13, 2009 at 1:04 am
Steve Jones - Editor (8/12/2009)
How much data is being sent? I've never had the publisher and subscriber being the same spec, usually the publisher is larger. However the subscriber needs to have the hardware needed to handle that level of changes coming from the publisher. I'd check tracer tokens, especially once the snapshot is done. See exactly which transactions are coming through and how long it is taking.
hi steve,
i'm currently using mssql 2000, so I don't really know if tracer tokens is available. the subscriber has a raid 0 with 4x1Tb Sata drive @7,2k rpm. compared to the publisher with 14x145Gb Sas raid10 @10k rpm.
will the write speed increase if I upgrade my memory from 4Gb to 16Gb?
August 13, 2009 at 2:03 am
How good is the network between the subscriber and publisher?
Transactional replication has to commit changes to the subscriber as a whole transaction (or multiple of transactions, depending on your setting of commitbatchsize and commitbatchthreshold).
If you have a large transaction being applied to the subscriber, and you get a network glitch part way through, all that work is rolled back, and started again. If the network is particularly flaky, then that can also cause large latency, as the distributor struggles to complete the transactions.
If the network is bad, setting commitbatchsize and commitbatchthreshold to very low values may help.
August 14, 2009 at 4:32 am
Ian Scarlett (8/13/2009)
How good is the network between the subscriber and publisher?Transactional replication has to commit changes to the subscriber as a whole transaction (or multiple of transactions, depending on your setting of commitbatchsize and commitbatchthreshold).
If you have a large transaction being applied to the subscriber, and you get a network glitch part way through, all that work is rolled back, and started again. If the network is particularly flaky, then that can also cause large latency, as the distributor struggles to complete the transactions.
If the network is bad, setting commitbatchsize and commitbatchthreshold to very low values may help.
Hi ian,
continuous ping from subscriber to publisher is <1ms. subscriber is using a full duplex 100Mbps, while publisher is in full duplex 1Gb , same as the distributor.
subscribers network card utilization average is 0.0 to 0.02% i'm using 100 and 1000 values for commitbatchsize and commitbatchthreshold, but polling interval is set to 1 sec.
but did notice though an eventlog in sql server of the distributor with this:
SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [e:\MSSQL\data\master.mdf] in database [master] (1). The OS file handle is 0x0000029C. The offset of the latest long IO is: 0x00000000dd8000
What does this indicate? a disk io bottleneck?
thanks Ian.
August 14, 2009 at 4:57 am
What does this indicate? a disk io bottleneck?
Possibly... if that's the only entry you've seen then it may be nothing to worry about. If you're seeing a lot of these, then you should definitely start some I/O troubleshooting.
I've just spotted that you're using an MSA1000. The project I'm on at the moment was having HUGE performance problems a couple of years ago, and that had an MSA1000.
I ran perfmon for a few days, and was consistently seeing Average write times of over 1500ms (yes, 1.5 seconds per write!) which is so far above the recommended figures that it was off the scale. While trying to convince the customer that they needed to do something urgently, one of the controllers on the SAN expired. When HP replaced it there was suddenly a 10 fold reduction in write times... still way above the recommened minimums, but enough to alleviate a lot of the performance issues.
It may be worth you running Perfmon just to rule out any I/O problems.
August 14, 2009 at 7:17 pm
Ian Scarlett (8/14/2009)
What does this indicate? a disk io bottleneck?
Possibly... if that's the only entry you've seen then it may be nothing to worry about. If you're seeing a lot of these, then you should definitely start some I/O troubleshooting.
I've just spotted that you're using an MSA1000. The project I'm on at the moment was having HUGE performance problems a couple of years ago, and that had an MSA1000.
I ran perfmon for a few days, and was consistently seeing Average write times of over 1500ms (yes, 1.5 seconds per write!) which is so far above the recommended figures that it was off the scale. While trying to convince the customer that they needed to do something urgently, one of the controllers on the SAN expired. When HP replaced it there was suddenly a 10 fold reduction in write times... still way above the recommened minimums, but enough to alleviate a lot of the performance issues.
It may be worth you running Perfmon just to rule out any I/O problems.
Hi Ian,
I'm actually seeing a lot of these messages in the sql logs of the distributor. Since I started the replication. I'm putting up a 4x1TB Raid 0 Sata Disks and transfer the distribution database to this volume.
For the msa1000, we currently have performance issue. The Distributor- Publisher (msa1000 is attached) latency is ranging from zero to 3 secs its the distributor and subscriber that I'm having trouble of.
I will let you know what will be the outcome.
thanks ian
August 15, 2009 at 10:12 am
Just a note, and not to impact performance any more, but R0 is asking for trouble. One failure and everything is lost. You might have to reinitialize the subscribers and distributor if it happened.
If things are taking that long to write to master, I'd lean towards there being a disk issue in hardware.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply