I was fighting an error that was very rare to appear. I originally put a post on a MS forum
The issue has delayed our project (with hundreds of people for the project) for one day, the pressure on me is intense, to say the least. Fortunately after 36 hrs (5 hrs sleep included [:D] ), the issue is finally solved.
Simple background introudction
Environment: SQL Server 2K5 EE (CU8 applied) + Win 2K3 SP1
Our replication is of transactional type, and one publication with 5 articles has one "push" subscription. One article, let's call it MyTable has 29 million records.
Publisher and Distributor is on the same box, while subscriber is on another box. The network share used in the replication is called \\MyServer\ReplData
After replication set up, I start the subscription re-init with a new snapshot generated. However I keep getting errors in the replication monitor when the distribution agent (referred to DA hereafter ) tries to replicate MyTable over to the subcriber. When DA tries to read the snapshot-generated files for MyTable, at some time, it always compains and gives out the following error msg.
The process could not bulk copy into table '"dbo"."MyTable"'. (Source:
MSSQL_REPL, Error number: MSSQL_REPL20037)
Get help: http://help/MSSQL_REPL20037
memory mapped file read failed
To obtain an error file with details on the errors encountered when
initializing the subscribing table, execute the bcp command that appears
below. Consult the BOL for more information on the bcp utility and its
supported options. (Source: MSSQLServer, Error number: 20253)
So initial thoughts / guesses were:
(1) the replication is not set up correctly ?
(2) the created file by snapshot agent is not right ?
(3) the folder \\MyServer\ReplData is corrupt ?
(4) the OS memory is corrupt (we have 32 GB on each server) ?
So I have tried various ways, including:
1. Delete MyTable in the publication, and then do a subscription re-init with a new snapshot, when the publication works fine, add back MyTable to the publication and then do a subscription re-init with a new snapshot again to address concern (1)
2. Delete everything in \\MyServer\ReplData, and then restart the snapshot for the publication to address concern (2)
3. Change the network shared folder from its original drive D: to a new drive E: to address concern (3)
4. Reboot the all servers involved in the replication to address the potential OS memory issue to address conern (4)
The error still comes out when DA is trying to replicate MyTable.
Ok, what can be wrong? Yes, I have some other candidates for concern
(5) MyTable corrupts and thus snapshot agent generates some bad files based on the corrupt MyTable ?
(6) Push subscription does not work for MyTable ?
(7) MyTable does not work in a publication when the publication has more than one article?
So let's do something to address these new concerns
5, Run dbcc checktable on MyTable, but there is no error msg reported
6, Drop the subscription, and then recreate a new "Pull" type subscription, Do a subscription re-init with a new snapshot
7. Remove MyTable from its original publication, and then create a new publication which contains one and only one article, i.e. MyTable. Do a subscription re-init with a new snapshot
However the error still appears when DA tries to replication MyTable.
However during these tests, I found that the error always occurs when DA tries to read a specific file (generated by snapshot agent in \\MyServer\ReplData), let's call this file MyTable_File_7.
Now I guess this MyTable_File_7 may contain some row data that DA cannot read after the file was loaded into memory, and thus "memory mapped file read failed"? My logic is that you can corrupt a text file by inserting a EOF in middle of the text file, so if MyTable_File_7 has some weird binary code in it, the file may be corrupted logically.
So I decide to output MyTable to a new table by running
select * into dbo.MyTable_2 from dbo.MyTable
then create a PK for MyTable_2 and then use this new table to replace MyTable in the publication. And the result is :
I Succeeded !!
How excited I am, now what I need to do is:
Truncate table dbo.MyTable
Insert into dbo.MyTable select * from dbo.MyTable2
and then remove MyTable_2 from the publication and replace it with MyTable, and do a subscription re-init. After another 30 min waiting, what? The error comes again? ! ! Come on !!
I cannot believe it. But the success of MyTable2 leads me to believe something must be wrong with the data in MyTable, and doing table truncate may actually only release the pages/extents that the table occupies, and then the insertion will probably re-use those occupied extents again. At this moment, I think the hard-disk may have some bad spots which MyTable may happen to use. So I did another way,
drop table dbo.MyTable -- this is to ensure all GAM, SGAM, PFS aer totally cleaned regarding this table
Create table dbo.MyTable (....)
Insert into dbo.MyTable select * from dbo.MyTable2
Now add MyTable back to the publication and do a subscription re-init.
Finally, it is successful !
The possible reason: some bad tracks on the hard-disk.
Our system adimin group is now involved in checking whether there is anything wrong with the hard-disk drives.