Nightly Snapshot Replication time has more than tripled

Question

Nightly Snapshot Replication time has more than tripled

Pam Brisjar

SSChampion

Points: 12094
More actions
January 10, 2011 at 8:15 am

#245156

We have a nightly snapshot replication process (yes, it has to be snapshot for lots of reasons) that has been running fine for a couple of months now. Of course, the day that I comment that I hadn't heard anything from the business in a while, I get an e-mail asking why the snapshot did not finish the night before. Logs showed
Violation of PRIMARY KEY constraint 'PK_CASE_LOG_HIST'. Cannot insert duplicate key in object 'cusfocus.case_log_history_repl'.
2011-01-05 02:21:25.507 Agent message code 20037. The process could not bulk copy into table '"cusfocus"."case_log_history_repl"'.
also an invalid EOF message (sorry, don't have that text)
Rather odd considering it's snapshot replication with a full nightly refresh. So I figure it's just some odd issue with the bcp file generation and we decide to just let it run the next night and I'll keep an eye on it. The good news is that it ran to completion. The bad news is that a process that was taking around two hours start to finish is now taking 6+ hours.
Nothing has changed on either server that I am aware of. Version of both is 9.0.4053. I am not seeing anything unusual in the logs. Performance metrics look within acceptable range.
I've suggested a server reboot but that has to be scheduled with the business.
Any ideas?

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 736568 More actions · Answer 1

Is there a complete drop and recreation of the table on the subscriber? I suspect that something is different in the schema, or the refresh is not configured correctly for some reason.

Pam Brisjar SSChampion Points: 12094 More actions · Answer 2

Pam Brisjar

SSChampion

Points: 12094

January 10, 2011 at 9:47 am

#1271483

It's a complete drop and refresh.

Steve Jones - SSC Editor SSC Guru Points: 736568 More actions · Answer 3

Then it doesn't make sense if there is a duplicate entry. Are there trigger or some other dependency on the subscriber?

Have you snagged the snapshot file on the subscriber and checked it?

Pam Brisjar SSChampion Points: 12094 More actions · Answer 4

Yeah, I know it makes no sense. No triggers, no dependencies. I did try to take a look at the file but couldn't quite figure out how. I didn't sweat that one too much as in subsequent loads I am no longer seeing this error. What I am seeing is way too long run times for the refresh.

Steve Jones - SSC Editor SSC Guru Points: 736568 More actions · Answer 5

That is strange. I might run a trace while this is running, look for some clue about how long each step takes. Might help you determine what is broken. I might also check the waitstats while it's running. Look for IO issues.

Pam Brisjar SSChampion Points: 12094 More actions · Answer 6

Tell me about it.

We have a couple of different monitoring tools in place (VM Ware's admin/monitoring tool and RedGate's SQL Monitor tool) and I'm not seeing anything too out of the ordinary there. Disk write time is around 300 ms on the drive where tempdb lives, that's about the only spike I see.

happycat59 One Orange Chip Points: 29389 More actions · Answer 7

On the poor performance side of your question - are there any processes that are blocking the snapshot creation ? The snapshot agent does need to get a non exclusive read lock on all of the tables in the publication. Maybe it is waiting for that

Pam Brisjar SSChampion Points: 12094 More actions · Answer 8

Nope. The process has a near-exclusive lock on the table.

Pam Brisjar SSChampion Points: 12094 More actions · Answer 9

Update:

I rebooted both servers and the first time the job ran it ran within the expected time frame.

Next time, back to taking way too long.

I adjusted the schedules thinking maybe the slightly later start time for the one post-reboot had an impact. No good. Still way too slow.

here's the stats I see in Agent:

Good run:

Date1/13/2011 8:53:42 PM

LogJob History (SQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5)

Step ID3

ServerSQLCFARCH01

Job NameSQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5

Step NameRun agent.

Duration01:09:19

Sql Severity0

Sql Message ID0

Operator Emailed

Operator Net sent

Operator Paged

Retries Attempted0

Message

************************ STATISTICS SINCE AGENT STARTED ***********************

01-13-2011 22:03:01

Total Run Time (ms) : 4156312 Total Work Time : 4153687

Total Num Trans : 1 Num Trans/Sec : 0.00

Total Num Cmds : 596 Num Cmds/Sec : 0.14

Total Idle Time : 0

Writer Thread Stats

Total Number of Retries : 0

Time Spent on Exec : 9531

Time Spent on Commits (ms): 0 Commits/Sec : 0.00

Time to Apply Cmds (ms) : 4153687 Cmds/Sec : 0.14

Time Cmd Queue Empty (ms) : 203 Empty Q Waits > 10ms: 1

Total Time Request Blk(ms): 203

P2P Work Time (ms) : 0 P2P Cmds Skipped : 0

Reader Thread Stats

Calls to Retrieve Cmds : 2

Time to Retrieve Cmds (ms): 172 Cmds/Sec : 3465.12

Time Cmd Queue Full (ms) : 4152422 Full Q Waits > 10ms : 171

*******************************************************************************

2011-01-14 03:03:01.173 Applied the snapshot to the Subscriber.

Slow run:

Date1/18/2011 9:41:01 PM

LogJob History (SQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5)

Step ID3

ServerSQLCFARCH01

Job NameSQLCFPROD01-CFPROD-CFReportingSnapsho-SQLCFARCH01-CFReports-A9FED3BC-2EEA-4BFC-B4AB-0D75A582B2E5

Step NameRun agent.

Duration04:47:52

Sql Severity0

Sql Message ID0

Operator Emailed

Operator Net sent

Operator Paged

Retries Attempted0

Message

************************ STATISTICS SINCE AGENT STARTED ***********************

01-19-2011 02:28:53

Total Run Time (ms) : 17261875 Total Work Time : 17260141

Total Num Trans : 1 Num Trans/Sec : 0.00

Total Num Cmds : 596 Num Cmds/Sec : 0.03

Total Idle Time : 0

Writer Thread Stats

Total Number of Retries : 0

Time Spent on Exec : 14578

Time Spent on Commits (ms): 0 Commits/Sec : 0.00

Time to Apply Cmds (ms) : 17260141 Cmds/Sec : 0.03

Time Cmd Queue Empty (ms) : 172 Empty Q Waits > 10ms: 1

Total Time Request Blk(ms): 172

P2P Work Time (ms) : 0 P2P Cmds Skipped : 0

Reader Thread Stats

Calls to Retrieve Cmds : 2

Time to Retrieve Cmds (ms): 344 Cmds/Sec : 1732.56

Time Cmd Queue Full (ms) : 17259578 Full Q Waits > 10ms : 166

*******************************************************************************

2011-01-19 07:28:53.335 Applied the snapshot to the Subscriber.

Ben Thul SSChasing Mays Points: 638 More actions · Answer 10

What is the value for pre_creation_cmd in sysarticles for the article in question? If it's 0, then replication isn't configured to get rid of the data before running the bcp files.

Pam Brisjar SSChampion Points: 12094 More actions · Answer 11

There is no "article in question" as it's the entire process. Once I did have that single error and that appeared to be a precipitating factor for the long times but the error has not reoccurred - just the extremely long run times.

But to answer your question, all are set to 1. I have spent time sitting and actively monitoring the replication and I see all the scripts being created and deployed properly. It's just taking too darned long.