January 10, 2011 at 8:15 am
We have a nightly snapshot replication process (yes, it has to be snapshot for lots of reasons) that has been running fine for a couple of months now. Of course, the day that I comment that I hadn't heard anything from the business in a while, I get an e-mail asking why the snapshot did not finish the night before. Logs showed
Violation of PRIMARY KEY constraint 'PK_CASE_LOG_HIST'. Cannot insert duplicate key in object 'cusfocus.case_log_history_repl'.
2011-01-05 02:21:25.507 Agent message code 20037. The process could not bulk copy into table '"cusfocus"."case_log_history_repl"'.
also an invalid EOF message (sorry, don't have that text)
Rather odd considering it's snapshot replication with a full nightly refresh. So I figure it's just some odd issue with the bcp file generation and we decide to just let it run the next night and I'll keep an eye on it. The good news is that it ran to completion. The bad news is that a process that was taking around two hours start to finish is now taking 6+ hours.
Nothing has changed on either server that I am aware of. Version of both is 9.0.4053. I am not seeing anything unusual in the logs. Performance metrics look within acceptable range.
I've suggested a server reboot but that has to be scheduled with the business.
Any ideas?
January 10, 2011 at 9:31 am
Is there a complete drop and recreation of the table on the subscriber? I suspect that something is different in the schema, or the refresh is not configured correctly for some reason.
January 10, 2011 at 9:47 am
It's a complete drop and refresh.
January 10, 2011 at 1:03 pm
Then it doesn't make sense if there is a duplicate entry. Are there trigger or some other dependency on the subscriber?
Have you snagged the snapshot file on the subscriber and checked it?
January 10, 2011 at 1:08 pm
Yeah, I know it makes no sense. No triggers, no dependencies. I did try to take a look at the file but couldn't quite figure out how. I didn't sweat that one too much as in subsequent loads I am no longer seeing this error. What I am seeing is way too long run times for the refresh.
January 10, 2011 at 1:30 pm
That is strange. I might run a trace while this is running, look for some clue about how long each step takes. Might help you determine what is broken. I might also check the waitstats while it's running. Look for IO issues.
January 10, 2011 at 2:23 pm
Tell me about it.
We have a couple of different monitoring tools in place (VM Ware's admin/monitoring tool and RedGate's SQL Monitor tool) and I'm not seeing anything too out of the ordinary there. Disk write time is around 300 ms on the drive where tempdb lives, that's about the only spike I see.
January 10, 2011 at 8:33 pm
On the poor performance side of your question - are there any processes that are blocking the snapshot creation ? The snapshot agent does need to get a non exclusive read lock on all of the tables in the publication. Maybe it is waiting for that
January 10, 2011 at 8:46 pm
Nope. The process has a near-exclusive lock on the table.
January 19, 2011 at 12:12 pm
Update:
I rebooted both servers and the first time the job ran it ran within the expected time frame.
Next time, back to taking way too long.
I adjusted the schedules thinking maybe the slightly later start time for the one post-reboot had an impact. No good. Still way too slow.
here's the stats I see in Agent:
Good run:
Date1/13/2011 8:53:42 PM
LogJob History (SQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5)
Step ID3
ServerSQLCFARCH01
Job NameSQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5
Step NameRun agent.
Duration01:09:19
Sql Severity0
Sql Message ID0
Operator Emailed
Operator Net sent
Operator Paged
Retries Attempted0
Message
************************ STATISTICS SINCE AGENT STARTED ***********************
01-13-2011 22:03:01
Total Run Time (ms) : 4156312 Total Work Time : 4153687
Total Num Trans : 1 Num Trans/Sec : 0.00
Total Num Cmds : 596 Num Cmds/Sec : 0.14
Total Idle Time : 0
Writer Thread Stats
Total Number of Retries : 0
Time Spent on Exec : 9531
Time Spent on Commits (ms): 0 Commits/Sec : 0.00
Time to Apply Cmds (ms) : 4153687 Cmds/Sec : 0.14
Time Cmd Queue Empty (ms) : 203 Empty Q Waits > 10ms: 1
Total Time Request Blk(ms): 203
P2P Work Time (ms) : 0 P2P Cmds Skipped : 0
Reader Thread Stats
Calls to Retrieve Cmds : 2
Time to Retrieve Cmds (ms): 172 Cmds/Sec : 3465.12
Time Cmd Queue Full (ms) : 4152422 Full Q Waits > 10ms : 171
*******************************************************************************
2011-01-14 03:03:01.173 Applied the snapshot to the Subscriber.
Slow run:
Date1/18/2011 9:41:01 PM
LogJob History (SQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5)
Step ID3
ServerSQLCFARCH01
Job NameSQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5
Step NameRun agent.
Duration04:47:52
Sql Severity0
Sql Message ID0
Operator Emailed
Operator Net sent
Operator Paged
Retries Attempted0
Message
************************ STATISTICS SINCE AGENT STARTED ***********************
01-19-2011 02:28:53
Total Run Time (ms) : 17261875 Total Work Time : 17260141
Total Num Trans : 1 Num Trans/Sec : 0.00
Total Num Cmds : 596 Num Cmds/Sec : 0.03
Total Idle Time : 0
Writer Thread Stats
Total Number of Retries : 0
Time Spent on Exec : 14578
Time Spent on Commits (ms): 0 Commits/Sec : 0.00
Time to Apply Cmds (ms) : 17260141 Cmds/Sec : 0.03
Time Cmd Queue Empty (ms) : 172 Empty Q Waits > 10ms: 1
Total Time Request Blk(ms): 172
P2P Work Time (ms) : 0 P2P Cmds Skipped : 0
Reader Thread Stats
Calls to Retrieve Cmds : 2
Time to Retrieve Cmds (ms): 344 Cmds/Sec : 1732.56
Time Cmd Queue Full (ms) : 17259578 Full Q Waits > 10ms : 166
*******************************************************************************
2011-01-19 07:28:53.335 Applied the snapshot to the Subscriber.
January 19, 2011 at 3:48 pm
What is the value for pre_creation_cmd in sysarticles for the article in question? If it's 0, then replication isn't configured to get rid of the data before running the bcp files.
January 20, 2011 at 7:23 am
There is no "article in question" as it's the entire process. Once I did have that single error and that appeared to be a precipitating factor for the long times but the error has not reoccurred - just the extremely long run times.
But to answer your question, all are set to 1. I have spent time sitting and actively monitoring the replication and I see all the scripts being created and deployed properly. It's just taking too darned long.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply