May 5, 2011 at 9:14 am
I'm developing my first remote service broker project and I'm having a transient issue that I can't figure out. I have service broker running on two different servers. All the messaging between the two servers are handled by auto-activation except for the initial message (I drop a message into the queue and the rest are backend processes). In a nutshell it goes like this:
1. Drop a message in a queue on the initiator
2. Autoactivation proc on inititator sends a message to the target
3. Target AutoActivation does some work, ends the conversation
4. Initiator Autoactivation (same proc as #2) receives the EndConversation and also Ends the conversation
It was a happy day when I got all the pieces working and talking to each other. However, about 30% of the time something breaks down and I start getting undeliverable message errors - "This message could not be delivered because it is a duplicate". For some reason the initiator keeps sending the same message to the target, and the target thinks it's a duplicate even though IT HASN'T DONE THE WORK (i.e it never shows up in the queue). That seems to me to be the odd part, if the message really was a duplicate why wasn't it put in the queue on the target and the work done at least? What in the world is the target doing with the message?!?
Here are the things I've considered:
1. I'm a bonehead and messed up somewhere in the service broker routing or configuration. Perhaps... but then why do some messages go through and others do not, and why can the exact same message sometimes succeed and sometimes fail?
2. Network issues - the reply to the initiator is getting lost. Perhaps... but then wouldn't the target resend the reply rather than just ignoring the issue?
3. I'm being punished for some unknown crime against society. (gee, I hope not!)
Do you have any ideas what might be wrong, or where I might look to start debugging?
Thanks,
Chad
May 5, 2011 at 10:45 am
May 5, 2011 at 11:39 am
I did, thank you. #3 seems closest to what I am experiencing. My “Message Classify” subclass is “1 – Local” with no TextData. Would that imply that the target thinks that the initiator is local?
I admit that it is very possible that my routes are messed up somehow, and that is the direction I’m leaning, but I am able to get some messages through so I’m not sure why it is flaky. Is there anything about my route that would make it so that some replies would be ok and some would not? Here is the SQL I used to create the route to the initiator from the target (I changed the service name, IP Address and Port so as not to reveal production info):CREATE ROUTE ROUTENAME
WITH SERVICE_NAME = '//BLAH/BLAH/SERVICENAME'
, ADDRESS = 'TCP://0.0.0.0:0000'
GO
Thanks,
Chad
May 5, 2011 at 11:44 am
By chance do you have a timeout parameter set on your BEGIN CONVERSATION line?
Also, you mention you can't find the message in the queue, but by chance have you seen the correct results from the message get populated, even though it whines about duplication? (I'm assuming this is an asynchronous cross-server trigger)
Edit: Oops, also meant to include this link on tuning Service Broker. It discusses the need for certain waitfors:
http://msdn.microsoft.com/en-us/library/ms166135.aspx
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
May 5, 2011 at 11:47 am
If most of the messages are sent and received successfully, I wouldn't assume any misconfiguration.
Could it be that network latency causes ack messages to be delivered too late back from target to initiator, which again causes the initiator to resend the message?
May 5, 2011 at 11:53 am
Nils Gustav Stråbø (5/5/2011)
If most of the messages are sent and received successfully, I wouldn't assume any misconfiguration.Could it be that network latency causes ack messages to be delivered too late back from target to initiator, which again causes the initiator to resend the message?
I see we're on the same thought pattern Nils. 🙂 I had the same thought, that something was slowing down the acknowledgement receipt in some way until it was too late and it started trying again.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
May 5, 2011 at 12:06 pm
Craig Farrell (5/5/2011)
By chance do you have a timeout parameter set on your BEGIN CONVERSATION line?Also, you mention you can't find the message in the queue, but by chance have you seen the correct results from the message get populated, even though it whines about duplication? (I'm assuming this is an asynchronous cross-server trigger)
Edit: Oops, also meant to include this link on tuning Service Broker. It discusses the need for certain waitfors:
Thanks Craig. I don't have the timeout set on the BEGIN CONVERSATION, and that brought to mind an interesting band-aid. I started writing a process on the initiator that would watch sys.transmission_queue for messages sent to this specific service that had been enqueued for more than 1 minute, then end and resubmit them. I'm guessing that if I set the LIFETIME, errors will come back into the queue and I can resubmit them from there? Is this what you were thinking when you asked about the lifetime timeout? That is more elegant than the direction I was going, but I would still rather that everything was delivered right the first time. I'll have to do some research - what happens if the lifetime expires but the target was just slow and is actually in the middle of it's work? Would it be possible then for the message to be resent when the work had already been done? Guess I could code for that easily enough.
No, the results/work that is supposed to be performed on the target does not get done. In fact, I turned off activation on the target queue just to be sure. The message never shows up in the target queue even though the profiler trace is showing duplicate message errors. Something is happening before it gets put in the queue.
I'm using WAITFOR RECEIVE with a timeout of 5000 in the activation procs on both ends. That seemed ok to me, should it be longer or shorter or removed?
Thanks again, I really appreciate having a few directions to look.
Chad
May 5, 2011 at 12:12 pm
Nils Gustav Stråbø (5/5/2011)
If most of the messages are sent and received successfully, I wouldn't assume any misconfiguration.Could it be that network latency causes ack messages to be delivered too late back from target to initiator, which again causes the initiator to resend the message?
That is possible. I'm trying to think of how I could test that, I would probably need a sniffer I think. I'm not too network techie (I have a hard time filtering through and understanding sniffer data), but let me see what I can discover along that route and I'll report back shortly.
Thanks,
Chad
May 5, 2011 at 1:06 pm
When I think about og, I had similar problems on a server to server sb implementation a few year ago. This was on sql2005, but I cant remeber what I did to fix og. I will try to find out tomorrow.
May 5, 2011 at 1:09 pm
<Big Sigh>Yea!</Big Sigh> I got it working. :-):-):-):-):-)
That was the short version. Here's the long one.
I got the sniffer running and was trying to get one good conversation and one bad conversation so I would have an easy comparison, but I was really having trouble. Sometimes it seemed like the sniffer was delayed as much as 15-20 seconds in showing what was happening and sometimes it seemed like it was missing packets completely. So I got a SysAdmin to sit with me, thinking it might be rerouting through a different part of the network and maybe a switch was dropping packets. So he looked at my capture and commented, "So this is between your box and the server, right?", "No, this is between the two servers, I'm running promiscuous. Wait..." Then it dawned on me.
I developed the Service Broker initiator solution on my box, going against a dev database on "ServerA" as the target. Two things happened at almost exactly the same time - I moved the initiator code from my box to "ServerB", and the dev database on "ServerA" was copied as a precautionary measure in advance of a major update. As a result, there were two databases on ServerA with routes back to an "initiator", one pointing to ServerB and one pointing to my box. After deleting all the service broker objects in that copy database and on my box everything is working well, no errors. WAHOO!
Thank you, thank you, thank you!
Chad
May 5, 2011 at 2:37 pm
Chad Crawford (5/5/2011)
As a result, there were two databases on ServerA with routes back to an "initiator", one pointing to ServerB and one pointing to my box. After deleting all the service broker objects in that copy database and on my box everything is working well, no errors. WAHOO!
D'oh. That would do it. Glad you found it! Thanks for letting us know. Note to self, always check all databases when troubleshooting really odd SB errors from now on!
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply