November 26, 2013 at 10:05 am
Hello,
I'm trying to find the most efficient way of handling failures in a process that looks like this:
CREATE PROCEDURE SP_CALC
AS
BEGIN
... some logic...
... get a set of values from a queue table ...
-- exec several SP's for the values from the queue
EXEC SP_1
EXEC SP_2
...
EXEC SP_N
... delete the processed values from the queue
END
The issue I noticed is that since there is no error handling in the calling procedure it is possible for one or more of the SP_N to fail but execution of SP_CALC will continue and the queue will still get deleted. Therefore the values may never get processed for this SP_N. The success of this process is critical to the functionality of the application it supports, it is resource intensive, runs a few times every minute, and there is error handling in each of the SP_N that causes them to rollback and return control to the caller.
I've tried a few things with the usual TRY..CATCH. My goal is to delete the queue only if all SP_N have succeeded and to rerun only the failed SP(I experimented with @retry in a loop for a preset number of times). Not sure if that's the best way to do it and not sure how to handle cases when SP_N doesn't succeed after a couple of retries.
If anyone has worked on a similar problem before, please share your experience.
November 26, 2013 at 11:16 am
You could try something along those lines (assuming the subroutines will include a RETURN 1 for success):
... get a set of values from a queue table ...
DECLARE
@returnValueSP1 INT = NULL,
@returnValueSP2 INT = NULL,
@returnValueSPN INT = NULL,
@run INT = 0;
-- exec several SP's for the values from the queue
WHILE @run < 5 AND @returnValueSP1 <>1
BEGIN
EXEC @returnValueSP1 = SP_1
SET @run = @run + 1
END
SET @run = 0
WHILE @run < 5 AND @returnValueSP2 <>1
BEGIN
EXEC @returnValueSP2 = SP_2
SET @run = @run + 1
END
...
SET @run = 0
WHILE @run < 5 AND @returnValueSPN <>1
BEGIN
EXEC @returnValueSPN = SP_N
SET @run = @run + 1
END
IF @returnValueSP1 = 1 AND @returnValueSP2 = 1 AND @returnValueSPN = 1
... delete the processed values from the queue
END[/code]
But this won't be a solution for the real problem: Why do the sprocs fail in the first place?
If this sproc is critical, resource intensive and runs frequently, the inner permanent retries might put you in even more trouble if two sprocs start to overlap (e.g. the first call could not finish within one minute and the 2nd call already started the next SP_CALC)...
I strongly recommend to fix the root cause (add the appropriate error handling and/or data evaluation prior to the processing), tune the code to reduce the resources required and make it more robust instead of trying to add more complexity by adding code to deal with the side effects.
November 26, 2013 at 2:40 pm
Hi Lutz,
Thanks for the detailed reply!
This is definitely one way to control SP rerun and whether the queue gets cleaned up. And I completely agree about having any issues handled at the level of SP_N first... it's work in progress.
If the queue doesn't get cleaned up because the records didn't get processed by SP_N (assuming it failed the allowed number of reruns), the next time SP_CALC runs it will need to process all the new records in the queue + all old records(that we didn't remove) for all SP_1,SP_2, ...,SP_N again, which is double the work.
Do you think it is even possible to minimize the impact on the entire process and design for the case where this one failed SP doesn't end up rerunning successfully?
Thank you,
V
November 26, 2013 at 2:50 pm
Do you think it is even possible to minimize the impact on the entire process and design for the case where this one failed SP doesn't end up rerunning successfully?
It depends on the business logic...
What should happen in such a szenario?
a) stop processing any more rows
b) ignore the rows that didn't get processed and continue with the remaining rows
Thinking about it it sonds more and more like a RBAR process under the hood...
Are there any loops involeved within each sproc (e.g. WHILE or CURSOR statements)?
If so, stop thinking about making any minor changes to this process and start from scratch using a set-based solution. I'm sure you'll be surprised about the performance and stability gains you'll end up with.
Can you please explain a little more detailed, what the code is doing? (drop me a PM if you don't want to discuss it in public)
November 27, 2013 at 11:27 am
Lutz,
Luckily nothing RBAR in this process ... I think in sets most of the time 🙂
Failure doesn't happen often - it's just a lot of data, lots of tables, kicks off lots of procedures, it's high frequency, and can directly affect data seen by the users, so I was hoping for SQL magic ..and minimum impact of the change.
But jokes on the side, you have a good point - I'll completely rethink how this works.
Decided to start with a minor change of logging the values that don't get processed if one of the SP_N's fails, in order to be able to process them later and analyze the root causes while making the rest of the design better.
Thanks for giving it a thought and for the great advice!
V
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply