November 21, 2014 at 4:36 am
I have a strange issue going on intermittently with one of our SQL servers. The setup we have is:
1. 1 SQL server with 138 Publications
2. Of those publications there is 438 published articles which are tables
3. Some publications have multiple subscribed databases (3 Max), giving a total of 890 subscribed articles.
4. All replication is transactional and push.
5. No articles are filtered
6. All publications are full sync.
Now and again we have an issue where an update \ insert \ delete performed on the published article does not occur on the subscriber.
For example publication A has 3 articles (X,Y,Z) which are tables and the publication is subscribed to in 3 databases, when the change occurs on publication A, article X, none of the subscribers get the change, however changes that happen on Publication A for articles Y and Z the replicate as expected.
I have checked many many things and as yet cannot find why the transaction has not replicated, this includes:
1. Checking the tracer token inserted is sent and received correctly
2. Ensured all replication agents are running and error free
3. Checked the ms_repl_errors table in the distribution database for errors
4. Checked the distribution database for undistributed transactions for the publication
5. Ensured all publications are active and have not expired.
This happens at random times and random articles.
Any suggestions as to where to look next?
MCITP SQL 2005, MCSA SQL 2012
November 21, 2014 at 7:51 am
hmmm thats in interesting one. Are you using Push or Pull replication? What version of SQL Server is this? Are the Subscriptions going to different servers? Is the Distributor Remote or Local on the Publisher?
This next question is important. Are the missing transactions consistently for one article or a group of articles?
i.e is it always consistently Article X that seems to be missing the updates or sometimes do X and Y replicate, but Article Z seems to have missed?
Also how prevalent this across all your Publications and Articles? What percentage of articles are affected?
From your question I have also assumed that if an Article fails to replicate the transaction, then this is affects all Subscribers? i.e if Article X is configured with three Subscribers, there's no chance that two of them will get the update and one will not?
Ok, so if you have answered that the Article(s) always missing the Transaction is consistently the same ones, then I would says script out your Publication containing X,Y,Z Articles and examine all the options specified on sp_addarticle and sp_addsubscription for those Articles to see if they differ.
If they dont differ then next analyse the sp_MSins, sp_MSdel, sp_MSIns stored procs for each related Article to see if they are strangely formatted somehow.
You could turn on detailed Logging for your distribution Agent to see if anything is thrown up in there when an update is not replicated.
You could also check your Distribution and Log Reader Agent profiles to see if they are non default and if so what are the customized settings? I say this because You can set them to ignore consistency errors for example.
Also at your Subscribers does anyone have write access to the DBs (nobody should have)? could it be that your replicated change is simple being updated by some rogue Application process?
If your SQL Server version is a bit out of date go through all the KB article for Cumulative Updates released after your version to see if there are any Replication fixes in them.
November 21, 2014 at 8:12 am
Answers in bold below:
SQL_Baby (11/21/2014)
hmmm thats in interesting one. Are you using Push or Pull replication? What version of SQL Server is this? Are the Subscriptions going to different servers? Is the Distributor Remote or Local on the Publisher?Push, SQL 2008 R2, all on one server.
This next question is important. Are the missing transactions consistently for one article or a group of articles?
No its completely random, different articles in different publications, even where the publication has multiple articles only one is generally an issue at any given time.
i.e is it always consistently Article X that seems to be missing the updates or sometimes do X and Y replicate, but Article Z seems to have missed?
See last answer
Also how prevalent this across all your Publications and Articles? What percentage of articles are affected?
Under 1% at any given time, i.e. yesterday it was 1 article in 1 publication out of all the articles published.
From your question I have also assumed that if an Article fails to replicate the transaction, then this is affects all Subscribers? i.e if Article X is configured with three Subscribers, there's no chance that two of them will get the update and one will not?
All subscribers to the same article are affected.
Ok, so if you have answered that the Article(s) always missing the Transaction is consistently the same ones, then I would says script out your Publication containing X,Y,Z Articles and examine all the options specified on sp_addarticle and sp_addsubscription for those Articles to see if they differ.
Been there and done that, all our publications have been created from the same scripted template, all that changes anytime one is created is the source and destinations
If they dont differ then next analyse the sp_MSins, sp_MSdel, sp_MSIns stored procs for each related Article to see if they are strangely formatted somehow.
Trying to avoid that given there is 438 published articles, and the fact that its not consistently the same ones that fail, to give context it can be months inbetween this issue cropping up, and we never know which publication\article will be affected, the fact nothing gets logged as an issue in replication monitor or the system tables in the distribution database is what really concerns me the most
You could turn on detailed Logging for your distribution Agent to see if anything is thrown up in there when an update is not replicated.
I'll look into this
You could also check your Distribution and Log Reader Agent profiles to see if they are non default and if so what are the customized settings? I say this because You can set them to ignore consistency errors for example.
Again all are created from a template, and we have this exact same setup in another data centre without issue, but the ignore consistency errors option is a very good shout.
Also at your Subscribers does anyone have write access to the DBs (nobody should have)? could it be that your replicated change is simple being updated by some rogue Application process?
Definetly not, the only accounts that have write access are DBA and replication accounts, this syetm is very locked down to meet government security standards, all application access is via stored procedures and I have labourously been through all of them to make sure none of them do anything but read data.
If your SQL Server version is a bit out of date go through all the KB article for Cumulative Updates released after your version to see if there are any Replication fixes in them.
SQL Server 2008 SP2 CU4 from memory
Its a little behind current but when looking at the CU's issued since none of them have documented changes for replication.
MCITP SQL 2005, MCSA SQL 2012
November 21, 2014 at 8:29 am
OK so your server is Publisher, Distributor and Subscriber? Wow that is a LOT of Jobs on one server. Have you got deadlock tracing turned on? Maybe there is contention somewhere.
How much RAM have you got and how much of this is available to the OS?
Do you defrag you Distribution database? How big is it?
Sorry for all the question just trying to get a clear picture.
You know the CommitBatchSize on your Distribution Agent Profile determines how many batches get committed\rolled back should a command fail. And each Subscription has its own Distribution Agent so basically I'm thinking something is killing your trascation (deadlock?) very infrequently. Maybe even a Profiler Trace may be necessary. But first see what your detailed logs throw up and checked for Deadlocks, fragmentation in Distribution DB.
Also I'd be interested to know how you identify an issue in all that traffic given that you get no errors? Do you run some sort of regular data validation checks or something?
hope this helps
November 21, 2014 at 9:01 am
SQL_Baby (11/21/2014)
OK so your server is Publisher, Distributor and Subscriber? Wow that is a LOT of Jobs on one server. Have you got deadlock tracing turned on? Maybe there is contention somewhere.We don't have deadlock tracing turned on, as you don't need to, all deadlock information is captured in the system default extended event which I have a query for obtaining that data, it was a good suggestion though. We checked yesterday and there was 1 deadlock on the server 4 hours prior to the issue caused by a SSIS package, fortunately all tables involved in replication have a DateModified column so I know exactly when the data was updated.
How much RAM have you got and how much of this is available to the OS?
Server is rather powerfull, running as active / passive 2 node failover cluster, 10 CPU's with 8 cores per CPU (80 Cores), 256GB Ram, 12 x 4TB SAN Storage drives, although I'm not sure what the implementation is in terms of number of DIMM's, underlying SAN etc.
Do you defrag you Distribution database? How big is it?
Distribution cleanup jobs are enabled on the DB, no index or statistics maintenance is performed. The DB is sized at 4GB data 1GB log, I would need to look at our in house DBA data warehouse to check space utilisation on the devices over the last 12 months, but we have custom alerting setup when we get under 25% on either device and we have had no alerts.
Sorry for all the question just trying to get a clear picture.
You know the CommitBatchSize on your Distribution Agent Profile determines how many batches get committed\rolled back should a command fail. And each Subscription has its own Distribution Agent so basically I'm thinking something is killing your trascation (deadlock?) very infrequently. Maybe even a Profiler Trace may be necessary. But first see what your detailed logs throw up and checked for Deadlocks, fragmentation in Distribution DB.
Ill need to look at the CommitBatchSize as they may be relevant, we have automated server side trace collection setup for the server but it only captures a 10 minute period once per hour which again we load to our DBA data Warehouse, unfortunately the transaction took place outside the 10 minute window
Also I'd be interested to know how you identify an issue in all that traffic given that you get no errors? Do you run some sort of regular data validation checks or something?
We have a custom solution developed in house written in python which perfoms the validation of all data, columns and row counts for each replicated table, it utilises the native SQL server command line utility tablediff, and we parse the output with some custom python code, tablediff takes about 30 seconds per table comparison we run, so its called in multiply threads (python code identifies number of threads at run time based on current cpu utilisation) so we cover all 890 subscribed articles fairly quickly.
hope this helps
MCITP SQL 2005, MCSA SQL 2012
November 21, 2014 at 9:18 am
I guess you got all bases covered 🙂
The only thing I can think of then (before an MS Support call!) is contention with all your Agent Jobs. Together with the Cleanup jobs that's a lot of connections hitting the same tables in your Distribution database (Have you checked disk waits or any other?). I have found Reorganizing the Indexes on the Distribution database in heavily replicated environments does help with performance as they do get fragmented, but I only do this with custom code that first checks how many commands are queued for replication and does the Reorg at a time of low replication activity.
I hope your server tracing or verbose logging picks up the cause of the issue. Would be interesting to know what it was when you eventually identify it.
November 21, 2014 at 9:25 am
We are hoping we can identify this ourselves as although we are a MS gold partner, this particular environment is so secure only employees of our company and a business partner can get anywhere near the servers. No data is allowed in or out of the DC in anyway other than what the applications display publicly. We cant even copy SQL scripts to the environment outside of a release so anything we code has to be coded up on that system
If we get to the bottom of it I'll post back.
MCITP SQL 2005, MCSA SQL 2012
November 21, 2014 at 9:37 am
yes, please do...and good luck!
November 24, 2014 at 8:58 am
We believe we have identified the root cause of this issue. We have articles published more than once in different publications, not many mind you but its been an oversight when the replication scripts have been created.
Replication in this system has grown arms and legs since its original design and it appears that during the implementation of additional requirements some times people have not checked if a table was already published.
MCITP SQL 2005, MCSA SQL 2012
November 24, 2014 at 9:40 am
ahh cool. At least it was something simple in the end. 🙂
June 5, 2018 at 10:39 pm
@RTaylor2208 - sorry to reopen this old thread - but we are having a very similar set of symptoms. I've documented our issue here:
https://dba.stackexchange.com/questions/206714/sql-transactional-replication-some-tables-frozen-but-others-working-fine
Currently we're working through the issue with Microsoft Support, but so far they've only agreed with us that there is an issue, and haven't yet found any clues as to how it is happening.
In the end of your issue (4 years ago!) - was it definitely the additional publications of articles that was the problem? That doesn't apply in our case - although we are doing a re-publishing scenario which may have some similar elements. [Source >> Republisher >> Subscriber].
June 7, 2018 at 6:09 am
As much as it was a very long time ago, it was indeed the source of our issue, once we removed the duplicated articles the issue never came back.
MCITP SQL 2005, MCSA SQL 2012
June 7, 2018 at 6:20 am
Ok - thanks for that, appreciated.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply