Tracking Failed Job Steps

Recently a customer was asking for a way to alert on job steps that failed, but the job succeeded. They really wanted a custom metric for SQL Monitor, which I submitted, but this post looks at the query in general, trying to determine if a job step failed.

Note: Let me start by noting that this is based on work by Kendra from her post, SQL Agent Jobs: Checking for failed steps at the end of a job.

When a job runs, by default, any step failure fails the job. You can see this in the Advanced tab for any job step. There are two drop downs on this tab, shown below, with the defaults. Success goes to the next step, failure ends the job.

However, you have choices. This images shows the other choices available for the failure action:

Various options for job step actions

There are jobs people set up that use the “Go to the next step” option instead of failing. People also sometimes have a step skipped if they current step fails, developing a workflow using the action as an IF..THEN..ELSE type of construct.

A Query for Job History

Kendra’s query was a good starting point, and I used most of it in the first CTE shown below. We start by querying msdb.dbo.sysjobactivity and finding the latest execution for all jobs. This result then joins with msdb.dbo.sysjobhistory on the job_id, which is the PK. However, we are only looking at steps, which is anything with a step_id > 0. The step_id = 0 is for the overall job.

WITH cteActivity (job_id, start_execution_date)
AS ( -- get the latest job execution for all jobs
    SELECT job_id,
           MAX(start_execution_date) AS start_execution_date
    FROM msdb.dbo.sysjobactivity
    GROUP BY job_id)
, cteJobStep (Job_ID, Step_Name, run_date, run_time)
    AS (SELECT jh.job_id,
               jh.step_name AS JobStepName,
               jh.run_date,
               jh.run_time
        FROM msdb.dbo.sysjobhistory jh
            INNER JOIN cteActivity ja
                ON jh.job_id = ja.job_id
        WHERE jh.run_status = 0 --step failed
              AND jh.step_id != 0
              --         and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC')
              AND --this block ensures that we just pull information from the most recent job run
            (
            -- The start time of the step, converted to datetime
            CONVERT(DATETIME, RTRIM(jh.run_date))
            + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100
               * 10
              ) / 216e4 >= ja.start_execution_date -- the time the job last started
            )

Once we qualify the job steps from the history table, we look at the activity and check only the latest run of the job. This means if I run a job where a step fails, and then run it again and the step succeeds, it doesn’t appear in this result set.

After this, I query that CTE and join back to the sysjobhistory table to look for the overall job status. If the overall job fails, we don’t include the results in here. We only have matches when the job status is 1, which means success.

SELECT COUNT(*)
FROM cteJobStep c
     INNER JOIN dbo.sysjobhistory AS h
         ON h.job_id = c.Job_ID
            AND h.step_id = 0
            AND h.run_status = 1;

Together, this allows us to just get the jobs that might not report a failure, but in which there was one. The full query is at the bottom.

Testing The Code

Of course, I want to test this, so I created a few jobs. One of these is a job where one middle step fails.

five step job with the middle step faiing

The second one is a step where there are two errors:

Job with two steps that fail

There also is a job where the job and the step always fail.

Single step job that always fails

And, of course, a job that just works:

Job always succeeds

For most of these steps, they are simple "select 1" for success or "select 1/0" for a failure. This ensures the steps always succeed or fail.

These are four scenarios, and not completely comprehensive, but give me a good set of situations in which to test my query. Now I want to check and see if my code is working. I'll do that in an automated way. I'll use code to run sp_start_job, and then other code to check if my query returns what I expect.

Automated Testing

I wanted to test this, but not by starting the job manually in SSMS and then running my query. It's easy to forget to run a particular test or not properly reset things when there's a problem. To me, this is best fixed with automated testing. I decided to write some tSQLt tests to cover my cases.

The Action

The first step was to put my query in a stored procedure. This just means I'm not copy/pasting code into multiple tests. Plus, if I really want to add this as an alert, I'd want a procedure. The code is shown here:

CREATE OR ALTER PROC CheckForFailedJobSteps AS;
WITH cteActivity (job_id, start_execution_date)
AS ( -- get the latest job execution for all jobs
    SELECT job_id,
           MAX(start_execution_date) AS start_execution_date
    FROM msdb.dbo.sysjobactivity
    GROUP BY job_id)
, cteJobStep (Job_ID, Step_Name, run_date, run_time)
    AS (SELECT jh.job_id,
               jh.step_name AS JobStepName,
               jh.run_date,
               jh.run_time
        FROM msdb.dbo.sysjobhistory jh
            INNER JOIN cteActivity ja
                ON jh.job_id = ja.job_id
        WHERE jh.run_status = 0 --step failed
              AND jh.step_id != 0
              --         and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC')
              AND --this block ensures that we just pull information from the most recent job run
            (
            -- The start time of the step, converted to datetime
            CONVERT(DATETIME, RTRIM(jh.run_date))
            + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100
               * 10
              ) / 216e4 >= ja.start_execution_date -- the time the job last started
            ))
SELECT COUNT(*)
FROM cteJobStep c
    INNER JOIN msdb.dbo.sysjobhistory
    AS
    h
        ON h.job_id = c.Job_ID
           AND h.run_date = c.run_date
           AND h.run_time = c.run_time
           AND h.step_id = 0
           AND h.run_status = 1;
RETURN;
GO

The Tests

I have a template for tests that does an outline of what I need. I've written some articles on tSQLt, and the basic one shows a simple test with the Assemble, Act, Asset pattern. That's what I'll use here, with a simple test. The code is below for one test. This starts with a new test class and then the test.

EXEC tsqlt.NewTestClass @ClassName = N'AdminChecks' -- nvarchar(max)
GO
CREATE PROCEDURE [AdminChecks].[test JobFailureProc with No Failures]
AS
BEGIN
    --------------------------------------------
----- Assemble
--------------------------------------------
    DECLARE @expected INT
      , @actual INT
SELECT @expected = 0
CREATE TABLE #ProcResult (actual INT)
    
    --------------------------------------------
----- Act
--------------------------------------------
-- Clear jobs
    EXEC msdb.dbo.sp_start_job @job_name = 'Second Job with Two Errors', @step_name='Fourth Step'
 WAITFOR DELAY '00:00:02'
    EXEC msdb.dbo.sp_start_job @job_name = 'A Job with a Step Failure', @step_name='Export Data'
 WAITFOR DELAY '00:00:02'
    EXEC msdb.dbo.sp_start_job @job_name = 'Job that always fails', @step_name='Success Step'
 WAITFOR DELAY '00:00:02'
-- run the job
EXEC msdb.dbo.sp_start_job @job_name = 'A Job that Always Succeeds'
 WAITFOR DELAY '00:00:02'

-- GET results
INSERT #ProcResult
EXEC CheckForFailedJobSteps
SELECT @actual = actual FROM #ProcResult AS pr
    --------------------------------------------
----- Assert
--------------------------------------------
EXEC tsqlt.AssertEquals @Expected = @expected, 
                        @Actual = @actual,   
                        @Message = N'Incorrect result'

END    
GO

I start in the assemble area by creating a table for the results and then a variable with my expected result. For this test, there are no failures with this job, so I expect a zero for the number of failed job steps.

In the Act part, I first need to be sure we don't have any failures. I could clear history, but since I know the state of this system. I could run the purge job, but that requires sysadmin, and potentially that's a problem here. In a team environment, this would be a complex system to set up. In a dev area where we might use containers, this is simpler. I'd likely leave a note in the proc itself that there need to be no job failures before running tests. I introduce a delay, to be sure the job completes before running the next line.

In any case, this is a minor flaw in the repeatable testing process. Hopefully someone finding failing tests would be able to determine a problem is the result of either bad code or previously existing failed steps. Since this requires some jobs to exist, this already is a more complex testing scenario than I'd expect.

For me, I run all my test jobs to ensure that the latest execution of all jobs has no failures. This means I'd return a zero by default. Next, I run the job I want to run, which is the "always succeeds" job for this test. I take the result and store it in the @actual variable.

In the asset area, I check if the expected and actual values are the same.

The other tests look just like this, expect the @expected value is set to a different number, depending on what is required. I also call a different job in the "act" section. I've attached the script to this article for the tests if you want to duplicate this. I end up with tests for these cases:

No job failure steps
A single job failure step
A single job with two failure steps
A job that has a failure step, but then runs successfully

To execute the tests, you can use the tsqlt.run or tsqlt.runtestclass commands with the test name or the test class. For me, I can run the tests in SQL Test, which I do when I'm doing things over and over. I like that I can start this and then still check queries in my query window. All four of my tests succeeded (eventually).

Test results in SQL Test

The command behind the scenes would be:

EXEC tsqlt.RunTestClass @TestClassName = N'AdminChecks'

This gives me these results:

Manual test class execution results

Summary

One of the edge cases in monitoring jobs is the places where a job step might fail, but the job succeeds. There are reasons why a particular job might be set up in this manner, but often we still want to know if a job step failed. This article presents a way to get that data for the last run only. I chose that method, as often failures are transient and I don't want to be querying for failures that might have resolved themselves.

We also set up a testing framework to automate testing of our code, which allows us to add new cases or ensure regressions do not occur as we might enhance our code in the future.

A Query for Job History

Testing The Code

Automated Testing

The Action

The Tests

Summary

Resources

Rate

Share

Share

Rate

Tracking Failed Job Steps

A Query for Job History

Testing The Code

Automated Testing

The Action

The Tests

Summary

Resources

Rate

Share

Share

Rate

Related content

Contribute

Privacy Policy

Terms of Use

It Depends

The Basics of Cryptography Part 2

Cookies on SQLServerCentral