my assorted articles over the past couple years you'll know that I've done a
fair bit with transactional replication. I wouldn't call myself an expert with
it, but I'm comfortable using it in production and dealing with the minor
mishaps that occur. Still, there is always something waiting to surprise you!
Thus, our story begins.
One of the great features of SQL2K over SQL7 is the ability to add/drop
columns on tables that are part of a replication publication without having to
drop all the subscribers. Pretty painless, you can do it through Enterprise
Manager using the publication properties or using the sp_repladdcolumn/sp_repldropcolumn
stored procs. I've used these a lot and they work well. If you're not familiar
with them basically they add/remove the column on the table, update the
publication so that the article includes the change, then replicates the change
to all the subscribers. Note that you can do add/delete only, not actually
change the definition of a column. This only works if you use either of the
methods I mention above - if you just add the column directly to the table it
will be there, but will not be part of the publication or be propagated to the
subscribers. This let's you decide which behavior you prefer.
On this particular day I needed to remove a column that was part of a
publication. I used EM (faster than looking up the rarely used proc syntax),
executed the drop, got an error message. When I went to check the results, I
wound up with the column gone from the article, gone from the publisher, still
present on the subscriber, and the subscriber marked as failed! What the heck,
right? I did a new snapshot and made a note to try to reproduce the behavior
later. Not that painful, but it turned a 1 min chore into 15 minutes.
I finally had some time today to experiment, started by making a copy of
Northwind and adding a column called 'TestColumn' with a default of 0 to the
customers table.
Next I created a publication with the customer table as the only article,
winding up with this:
The next step was to try to drop the column. Retracing my earlier footsteps I
clicked on the 'Filter Columns' tab, selected the column, and clicked the Drop
Selected Column button.
Here is the standard confirmation message - clicked Yes:
I then received this error message:
Which sort of makes sense. The EM table designer is smart enough to handle
stuff like this for you, this is just doing a straight alter/drop. I then
checked the publication expecting to see the column gone, but it was still
there? Was still in the publisher table and in the subscriber table as well.
Losing my mind perhaps? Just to make sure the world was still sane I removed the
default from the column and redid the drop, it worked fine. So either I didn't
accurately note what happened the first time, or it's not reproducible.
The only alternate scenario that I could think of was that the default must
have been on the subscriber. I added the column back with a default as before,
but this time I removed the default from the publisher and verified the default
was present on the subscriber. Ran through the sequence again, no error message,
appeared to complete successfully. Column gone from the article, removed from
the publisher table. Refreshed replication monitor and see the sign of bad
things. Checking the distribution agent revealed this:
The entire text of the last command was as follows:
if exists (select * from syscolumns where
name='TestColumn' and id = object_id('Customers')) begin if exists (select *
from sysobjects where name='syspublications') if exists (select * from
sysarticles where objid=object_id('Customers')) and @@microsoftversion >=
0x07320000 exec sp_repldropcolumn @source_object=N'[Customers]',@column=N'TestColumn'
else alter table [Customers] drop column [TestColumn] else alter table
[Customers] drop column [TestColumn] end
At this point the subscription is broken. Restarting the distribution agent
just returns the same error. Thinking that it might be possible to avoid a
snapshot, I executed sp_browsereplcmds to see what was queued, revealing this:
This is just a readable version of info from the msrepl_commands table in the
distribution database. Xact_seqno 0x00000054000000100038 looks to be the one
that matches the error message, so the next step is to query the table directly.
Wind up with this:
Did some experimenting, wound up deleting all rows that had a type =
1073741859. Don't try this in production! Ran the distribution agent, it ran
without error. Time to verify. I insert a row on the publisher, get this error:
Which makes sense. The publisher doesn't have the column so it's not passing
a value/placeholder for the TestColumn column, but the stored proc that
replication uses on the subscriber still expects it. Here is what the subscriber
proc looks like (autocreated by replication):
ALTER procedure [sp_MSins_Customers] @c1
nchar(5),@c2 nvarchar(40),@c3 nvarchar(30),@c4 nvarchar(30),@c5 nvarchar(60),@c6
nvarchar(15),@c7 nvarchar(15),@c8 nvarchar(10),@c9 nvarchar(15),@c10
nvarchar(24),@c11 nvarchar(24),@c12 int
AS
BEGIN
insert into [Customers]( [CustomerID], [CompanyName], [ContactName], [ContactTitle],
[Address], [City], [Region], [PostalCode], [Country], [Phone], [Fax], [TestColumn]
)
values ( @c1, @c2, @c3, @c4, @c5, @c6, @c7, @c8, @c9, @c10, @c11, @c12 )
I'm going to modify it manually to remove TestColumn and rerun the agent.
Agent runs, row is present on the subscriber. In production you'd want to modify
the update procedure as well. At this point you can drop the TestColumn from the
subscriber.
I did a web search on msrepl_commands, found nothing to indicate what the
type column means. As far as I can tell the value 1073741859 indicates a schema
change, either add or drop. Anyone have more info?
I think what I found out is that at some point either the default existed on
the publisher and subscriber and was deleted from the subscriber, or the default
was just added to the subscriber. Executing the drop column failed because the
default existed, which stopped the distribution agent. At that point the easiest
(and safest) fix is to just execute a new snapshot. Alternatively you can try
removing the rows from msrepl_commands and update the related stored procs
manually. Doable, but has some risk, I'd say mainly that you might remove rows
that aren't part of the problem and not realize it, or perhaps worse, affect
another publication. If you were to goof you could fix it by doing a new
snapshot. The short story is to make sure you've removed defaults from the
column on both publisher and subscriber before you try to drop the column.