Note: Let me start by noting that this is based on work by Kendra from her post, SQL Agent Jobs: Checking for failed steps at the end of a job.
When a job runs, by default, any step failure fails the job. You can see this in the Advanced tab for any job step. There are two drop downs on this tab, shown below, with the defaults. Success goes to the next step, failure ends the job.
A Query for Job History
Kendra’s query was a good starting point, and I used most of it in the first CTE shown below. We start by querying msdb.dbo.sysjobactivity and finding the latest execution for all jobs. This result then joins with msdb.dbo.sysjobhistory on the job_id, which is the PK. However, we are only looking at steps, which is anything with a step_id > 0. The step_id = 0 is for the overall job.
WITH cteActivity (job_id, start_execution_date) AS ( -- get the latest job execution for all jobs SELECT job_id, MAX(start_execution_date) AS start_execution_date FROM msdb.dbo.sysjobactivity GROUP BY job_id) , cteJobStep (Job_ID, Step_Name, run_date, run_time) AS (SELECT jh.job_id, jh.step_name AS JobStepName, jh.run_date, jh.run_time FROM msdb.dbo.sysjobhistory jh INNER JOIN cteActivity ja ON jh.job_id = ja.job_id WHERE jh.run_status = 0 --step failed AND jh.step_id != 0 -- and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC') AND --this block ensures that we just pull information from the most recent job run ( -- The start time of the step, converted to datetime CONVERT(DATETIME, RTRIM(jh.run_date)) + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100 * 10 ) / 216e4 >= ja.start_execution_date -- the time the job last started )
Once we qualify the job steps from the history table, we look at the activity and check only the latest run of the job. This means if I run a job where a step fails, and then run it again and the step succeeds, it doesn’t appear in this result set.
SELECT COUNT(*) FROM cteJobStep c INNER JOIN dbo.sysjobhistory AS h ON h.job_id = c.Job_ID AND h.step_id = 0 AND h.run_status = 1;
Together, this allows us to just get the jobs that might not report a failure, but in which there was one. The full query is at the bottom.
Testing The Code
Of course, I want to test this, so I created a few jobs. One of these is a job where one middle step fails.
There also is a job where the job and the step always fail.
And, of course, a job that just works:
For most of these steps, they are simple "select 1" for success or "select 1/0" for a failure. This ensures the steps always succeed or fail.
These are four scenarios, and not completely comprehensive, but give me a good set of situations in which to test my query. Now I want to check and see if my code is working. I'll do that in an automated way. I'll use code to run sp_start_job, and then other code to check if my query returns what I expect.
Automated Testing
I wanted to test this, but not by starting the job manually in SSMS and then running my query. It's easy to forget to run a particular test or not properly reset things when there's a problem. To me, this is best fixed with automated testing. I decided to write some tSQLt tests to cover my cases.
The Action
The first step was to put my query in a stored procedure. This just means I'm not copy/pasting code into multiple tests. Plus, if I really want to add this as an alert, I'd want a procedure. The code is shown here:
CREATE OR ALTER PROC CheckForFailedJobSteps AS; WITH cteActivity (job_id, start_execution_date) AS ( -- get the latest job execution for all jobs SELECT job_id, MAX(start_execution_date) AS start_execution_date FROM msdb.dbo.sysjobactivity GROUP BY job_id) , cteJobStep (Job_ID, Step_Name, run_date, run_time) AS (SELECT jh.job_id, jh.step_name AS JobStepName, jh.run_date, jh.run_time FROM msdb.dbo.sysjobhistory jh INNER JOIN cteActivity ja ON jh.job_id = ja.job_id WHERE jh.run_status = 0 --step failed AND jh.step_id != 0 -- and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC') AND --this block ensures that we just pull information from the most recent job run ( -- The start time of the step, converted to datetime CONVERT(DATETIME, RTRIM(jh.run_date)) + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100 * 10 ) / 216e4 >= ja.start_execution_date -- the time the job last started )) SELECT COUNT(*) FROM cteJobStep c INNER JOIN msdb.dbo.sysjobhistory AS h ON h.job_id = c.Job_ID AND h.run_date = c.run_date AND h.run_time = c.run_time AND h.step_id = 0 AND h.run_status = 1; RETURN; GO
The Tests
I have a template for tests that does an outline of what I need. I've written some articles on tSQLt, and the basic one shows a simple test with the Assemble, Act, Asset pattern. That's what I'll use here, with a simple test. The code is below for one test. This starts with a new test class and then the test.
EXEC tsqlt.NewTestClass @ClassName = N'AdminChecks' -- nvarchar(max) GO CREATE PROCEDURE [AdminChecks].[test JobFailureProc with No Failures] AS BEGIN -------------------------------------------- ----- Assemble -------------------------------------------- DECLARE @expected INT , @actual INT SELECT @expected = 0 CREATE TABLE #ProcResult (actual INT) -------------------------------------------- ----- Act -------------------------------------------- -- Clear jobs EXEC msdb.dbo.sp_start_job @job_name = 'Second Job with Two Errors', @step_name='Fourth Step' WAITFOR DELAY '00:00:02' EXEC msdb.dbo.sp_start_job @job_name = 'A Job with a Step Failure', @step_name='Export Data' WAITFOR DELAY '00:00:02' EXEC msdb.dbo.sp_start_job @job_name = 'Job that always fails', @step_name='Success Step' WAITFOR DELAY '00:00:02' -- run the job EXEC msdb.dbo.sp_start_job @job_name = 'A Job that Always Succeeds' WAITFOR DELAY '00:00:02' -- GET results INSERT #ProcResult EXEC CheckForFailedJobSteps SELECT @actual = actual FROM #ProcResult AS pr -------------------------------------------- ----- Assert -------------------------------------------- EXEC tsqlt.AssertEquals @Expected = @expected, @Actual = @actual, @Message = N'Incorrect result' END GO
I start in the assemble area by creating a table for the results and then a variable with my expected result. For this test, there are no failures with this job, so I expect a zero for the number of failed job steps.
In the Act part, I first need to be sure we don't have any failures. I could clear history, but since I know the state of this system. I could run the purge job, but that requires sysadmin, and potentially that's a problem here. In a team environment, this would be a complex system to set up. In a dev area where we might use containers, this is simpler. I'd likely leave a note in the proc itself that there need to be no job failures before running tests. I introduce a delay, to be sure the job completes before running the next line.
In any case, this is a minor flaw in the repeatable testing process. Hopefully someone finding failing tests would be able to determine a problem is the result of either bad code or previously existing failed steps. Since this requires some jobs to exist, this already is a more complex testing scenario than I'd expect.
For me, I run all my test jobs to ensure that the latest execution of all jobs has no failures. This means I'd return a zero by default. Next, I run the job I want to run, which is the "always succeeds" job for this test. I take the result and store it in the @actual variable.
In the asset area, I check if the expected and actual values are the same.
The other tests look just like this, expect the @expected value is set to a different number, depending on what is required. I also call a different job in the "act" section. I've attached the script to this article for the tests if you want to duplicate this. I end up with tests for these cases:
- No job failure steps
- A single job failure step
- A single job with two failure steps
- A job that has a failure step, but then runs successfully
To execute the tests, you can use the tsqlt.run or tsqlt.runtestclass commands with the test name or the test class. For me, I can run the tests in SQL Test, which I do when I'm doing things over and over. I like that I can start this and then still check queries in my query window. All four of my tests succeeded (eventually).
The command behind the scenes would be:
EXEC tsqlt.RunTestClass @TestClassName = N'AdminChecks'
This gives me these results:
Summary
One of the edge cases in monitoring jobs is the places where a job step might fail, but the job succeeds. There are reasons why a particular job might be set up in this manner, but often we still want to know if a job step failed. This article presents a way to get that data for the last run only. I chose that method, as often failures are transient and I don't want to be querying for failures that might have resolved themselves.
We also set up a testing framework to automate testing of our code, which allows us to add new cases or ensure regressions do not occur as we might enhance our code in the future.