October 18, 2010 at 6:13 am
So I had a replication pull distribution fail because of a primary key error (the subscriber only gets its data from replication so something fishy going on). This is on my 2000 to 2005 setup.
So I checked the last LSN for this article and searched back using browsereplcmds. There was an update (table has no identity column) that had clearly run on the subscriber, but the very next command (the first of a new transaction) was an Insert that had not run as the record was not in the subscriber. Odd. So I checked all records in that transaction and none of them were in the subscriber.
Next I manually entered the next Insert statement - it ran fine.
After many hours of going round in circles I had to skip errors and it skipped 2 - the one I had entered, and 1 other.
I cannot see what this other command could have been!
I can't quite work out how to see (from the usual tables and stored procs) which of the commands in Distribution have actually been distributed so it was hard to troubleshoot. Any help there?
What else would one recommend to troubleshoot such an issue, and does this sound like it could just be a bug caused by a 2000 to 2005 subscription?
Thankyou in advance!!
October 18, 2010 at 7:45 am
Don't know if this is a SQL 2000 to SQL 2005 replication issue - but what we do is something similar to what you've already done. However we either use replication monitor or check the distribution agent job log or enable output logs for the distribution agent to locate the xact_seqno and the command_id for the failing command.
Then use sp_browsereplcmds (passing in the xact_seqno as the @xact_seqno_start and the @xact_seqno_end parameters and the command_id as the @command_id parameter) to locate the exact command that is causing this issue. Normally this should be enough to locate the command causing the problem and to fix the issue.
Also - if you face this error again you can add the -Output clause to the distribution agent to log the error details to a file (and then restart the distribution agent to start logging to a file). From there you should be able to get the appropriate xact_seqno and the command_id which can be used to locate the command (note - turn off the -Output flag once done debugging otherwise it will end up flooding the drive with massive log files if there's a lot of replication activity).
You could also try to check the MSDistribution_history table on the distributor to check if there's any information there that will help - for e.g. are there any entries in this table with the error information that can be used to debug further.
October 18, 2010 at 8:16 am
Thanks for the reply.
I will try using output - good thinking. I normally only turn that on for vague errors - should have guessed it would contain more useful information in this case.
I did search the dist history table and that is where I gathered the last LSN. But in this instance I could not narrow down the troublemaker from that (the next transaction and command after the last LSN ran fine).
Next time it happens I'll check the output and hopefully it will give me a exact transaction and command!
Thanks
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply