Detecting when a job/sproc has failed

  • How can I detect when a stored procedure fails?

    Background:

    I have two stored procedures that are very similar. One runs as a scheduled job that launches just after midnight and then terminates itself 23 hours later. The other job runs from a workstation under Query Analyzer.

    Each is basically running a dispatch loop.

    It checks a table for an unprocessed entry.

    If an unprocessed entry is there

    it runs a particular stored procedure based on the table entry

    it marks the entry as processed

    else it does a WAITFOR DELAY

    it updates a heartbeat table entry to show the application is still alive

    Loop back to check for another unprocessed entry

    I would like another job to be able to determine if those jobs are still running.

    Initially I thought I could look at the @@SPID entry in the heartbeat table and see if this process was in SYSPROCESSES. There are two problems with this approach:

    1) SPIDs get reused. For a period of time the SPID is not found in sysprocesses, but eventually another job gets assigned that same SPID. So I need to ensure that the SPID is the one I am interested in and not a re-cycle

    2) On the interactive job the stored procedure could fail, with a network error for example, but the process is still alive and would show up in SYSPROCESSES

    What other approaches should I consider?

    (I know, upgrade to 2005 and do error handling in SSIS)

    - Randall Newcomb

  • The job is easy, you set up an operator and set notifications for completion or failure.

    For the workstation, does someone kick this off? When it fails, meaning it stops running, does it error out or just stop running? What I might try is use osql instead and put it in a batch job on the workstation. Click the batch job, let it call osql and the proc, and then add some notification (email, net send, etc) in the batch job for when the job stops. That won't help you if the network dies, but it will if the process dies.

    Alternatively, you could have the proc write a status update to a table before the WAITFOR that has the latest time. Have another job check that table and if there isn't an entry less than xx minutes old, ping you.

  • For the workstation, does someone kick this off?

    Yes, it runs on a secondary desktop system in my cubicle. I'd like the process to run 24x7x365.

    When it fails, meaning it stops running, does it error out or just stop running?

    The most frequent failure (once every few weeks) is a network connection failure. The Query Analyzer session that was running the stored procedure displays the network error message and is then ready for the next command to be run, i.e. if I notice I can just press F5 to restart the stored procedure.

    Alternatively, you could have the proc write a status update to a table before the WAITFOR that has the latest time. Have another job check that table and if there isn't an entry less than xx minutes old, ping you.

    Thanks. That looks like the most workable idea. This is what I was working on when I noticed that the SPID was getting re-used. I was trying to determine if the process had failed vs. was just running a long query. I'll keep working on this approach since it seems that I am on the right track.

    - Randall Newcomb

  • The solution to this particular problem was to compare the date/time that the stored procedure wrote to the heartbeat table showing when the process started vs. the date/time of the last batch for that process in sysprocesses.

    This bit of code may help someone else:

    select h.*,p.*

    from warehouse.dbo.Maint_Heartbeat h

    left join [master].[dbo].[sysprocesses] p

    on h.SPID=p.SPID

    where h.completed is NULL

    and p.last_batch<=h.started

    This joins information from the heartbeat table with information from sysprocesses. The assumption is that if the completed field is still NULL then the process should be running. However, if the Last_Batch in sysprocesses is GREATER than the Started time logged in the heartbeat table then this is a re-used SPID.

    - Randall Newcomb

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply