SQL Agent job intermittently fails to execute SSIS package

  • I need some suggestions for troubleshooting this issue.  I have the following setup:
     
    Application T-SQL
    I have SQL code in an application that executes a SQL Agent Job.   (exec @rtn = msdb.dbo.sp_start_job 'Job Name')
    If the SQL Agent Job returns a 0, I report a successful launch on the UI.
    If the SQL Agent Job returns a 1, I report a failure to launch on the UI.
     
    SQL Agent Job
    Step 1 – Execute SSIS Package
    On Success – End Job Successfully
    On Failure – Go To Next Step
    Step 2 – Send Custom Failure Email
    On Success – End Job Successfully
    On Failure – End Job Unsuccessfully
     *In the event of unsuccessful SQL Agent job run, a default SQL Agent failure email is sent.
     
    SSIS Package
    Runs a bunch of steps to import a spreadsheet into a table.
    If any step fails, it generates a custom failure email particular to the step that failed.
    If whole package succeeds, it generates a custom success email.
     
    Here is the behavior:
      
    Every two weeks, the user manually initiates the T-SQL from within the application.  The first time the user runs it that day, they get a message from the application saying the job started successfully, but then they get the custom failure email from SQL Agent under Step 2.  Based on the state of the data, the package clearly has not run successfully.  None of the package’s custom emails are received.  
      
    Next the user re-runs the job; they do nothing else but re-run it.  Now, the entire thing runs successfully.  And it runs successfully no matter how many times it is run again that day.   That is until two weeks later (after two weeks of not running it) and then the cycle repeats.
     
    My troubleshooting steps:

    I’ve manually run the SSIS package multiple times by itself.  I’ve update the file that it imports the same way the user does.  I can’t get the SSIS package to fail.

    I’ve run the SQL Agent manually multiple times.   Each time it kicks off the SSIS package just fine.  But I’ve done this after I’ve received the initial failure email.  I’ve yet to be able to reproduce the error.  But I have yet to try this after a week of downtime.  I plan to test that next.
     
    My thoughts:
      
    SQL Agent is telling the T-SQL execution that it successfully started, which is corroborated by the fact that SQL Agent is saying it failed to execute the SSIS package via the custom email. And the SSIS package is not saying anything, which suggests it isn’t running at all.   I'm thinking this is an issue between SQL Agent and SSIS.  
      
    I have several other jobs set up the exact same way.  None of them are having any issues.   The only difference is that these other jobs are scheduled to run nightly and are only occasionally kicked off manually.   The one that is failing is the only one that sits dormant for days/weeks at a time.  That is the only thing different that I can point to.

    The SQL server and the SSIS packages are on the same physical server.  So, it’s unlikely to be a connection issue from a network perspective.  It’s possible that it’s a Microsoft issue since SQL Agent is using a proxy to access the SSIS package based on standard configuration.  But I can’t find anything on the tech support sites where someone has reported the same symptoms.   Everything I’ve researched on that topic discusses things that would stop it from working each and every time.  I’ve found nothing that would cause it to fail once and then start working… as if something were asleep and being woken up.

  • I assume that you are using sp_start_job to start the Agent job?

    If so, the fact that it returns 'success' means only that the job was started successfully. It does not mean that the job completed successfully.

    sp_start_job starts a separate async process to execute the job.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Yes, I'm using sp_start_job.    Correct.  I'm only telling the user in the UI that the job started successfully.   They then wait for an email to tell them whether it completes successfully or not.   Since they get a failure email from SQL Agent and not from the SSIS package, I'm assuming that for some reason SQL Agent is unable to initiate the package at all.  But I have no idea why it would have trouble on the first attempt, but then work on all the subsequent attempts.

    *I'll edit my original post to be clearer.

  • Michael Megley - Thursday, March 23, 2017 8:14 AM

    Yes, I'm using sp_start_job.    Correct.  I'm only telling the user in the UI that the job started successfully.   They then wait for an email to tell them whether it completes successfully or not.   Since they get a failure email from SQL Agent and not from the SSIS package, I'm assuming that for some reason SQL Agent is unable to initiate the package at all.  But I have no idea why it would have trouble on the first attempt, but then work on all the subsequent attempts.

    *I'll edit my original post to be clearer.

    Are you able to look at the Agent job history and get any more detail about the failures?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • So, that was helpful.  I am finding error messages from SSIS.  (Clearly, I need to revisit my SSIS package to get the custom failure emails to execute properly.)
    Below is the error message from SQL Agent History.  I did several different searches based on these error codes.  The threads discussed issues with security permissions and 32 bit vs. 64 bit issues.
    In all cases they caused permanent issues, not intermittent issues.   I've already dealt with the 32 bit issue when I created these jobs. 
    I have the 32 bit driver setup on the server, with each SSIS package set to not run 64 bit run time in debug mode.  And I have the SQL agent job set to run the package in 32 bit mode.
    If this were the issue, I would expect it to fail every time it runs.

    03/23/2017 06:32:31,PTO Balance Import,Error,1,SERVERNAME,PTO Balance Import,ExecutePackage,,Executed as user: SERVERNAME\appPROXY. Microsoft (R) SQL Server Execute Package Utility  Version 12.0.5000.0 for 32-bit  Copyright (C) Microsoft Corporation. All rights reserved.    Started:  6:32:31 AM  Error: 2017-03-23 06:32:32.70     Code: 0xC0202009     Source: PTO Balance Import Connection manager "Excel Connection Manager"     Description: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005.  An OLE DB record is available.  Source: "Microsoft Access Database Engine"  Hresult: 0x80004005  Description: "External table is not in the expected format.".  End Error  Error: 2017-03-23 06:32:32.70     Code: 0xC020801C     Source: Import PTO Balances PTO Balance [14]     Description: SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER.  The AcquireConnection method call to the connection manager "Excel Connection Manager" failed with error code 0xC0202009.  There may be error messages posted before this with more information on why the AcquireConnection method call failed.  End Error  Error: 2017-03-23 06:32:32.70     Code: 0xC0047017     Source: Import PTO Balances SSIS.Pipeline     Description: PTO Balance failed validation and returned error code 0xC020801C.  End Error  Error: 2017-03-23 06:32:32.70     Code: 0xC004700C     Source: Import PTO Balances SSIS.Pipeline     Description: One or more component failed validation.  End Error  Error: 2017-03-23 06:32:32.72     Code: 0xC0024107     Source: Import PTO Balances      Description: There were errors during task validation.  End Error  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  6:32:31 AM  Finished: 6:32:32 AM  Elapsed:  0.797 seconds.  The package execution failed.  The step failed.,00:00:01,0,0,,,,0

  • Looks like others have experienced something similar (see here and here, for example).

    You might have to play around with connection strings & do some detective work on the source files.

    Not sure I can help any further than that.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I looked at those and they don't seem to be related.  I already have an ACE.OLDB connection manager and an Excel 2007-2010 file.  I'm using Visual Studio 2015 with most current .net framework.  The same file located on the shared server folder is being opened by the user and data pasted over the top of the old data.  It's not like a new file is being created in a different format.  And in the scenarios you referenced, their issues didn't just go away on their own.  I'm really baffled how the exact same file can fail on the first run and then succeed on the second run.

  • Do you have any suggested approaches on how I could test the connection string in SSIS or test the source file?   So far, I have been unable to replicate the issue.  Which typically means it's some kind of user error.  But in this case, I've tried replicating all the things the user could have done wrong and still not able to replicate. 

  • You may not get too much more information, but do you have logging turned on for your SSIS package(s)? You can report the On Warning and On Error messages to a file, and then inspect which process is throwing the error. Sometimes, that output is richer than what you get in the History for a SQL Agent job.

    --=Chuck

  • I've tracked down all the error messages and they just are not specific enough to tell me why SSIS is having trouble making the connection to the Excel file.  I can't find anything wrong with the Excel file.  I'm able to upload it just fine.

    So, I sat down with the user and we were able to replicate the issue after two weeks of not running the package.   The new excel file was downloaded from the vendor, manually edited, and then saved over the old excel file in the share drive where it gets uploaded from.  She manually initiated the upload and SSIS threw the error.   She then initiated the upload again right away (no other action taken) and the package succeeded without any issues.   I then personally repeated all of the exact same steps over the next few days, but the package succeeded every time. 

    We then waited another 2 weeks, again not running the package during that time (this is a bi-weekly payroll process).   The user got the error the first time she tried to upload... and then it succeeded every single time after that.

    I am open to further suggestions.   If nothing else works, I'm going to try this shot in the dark:

    1) I'm going to set all of the Delay Validation to True.   Just cause.
    2) We are doing tempdb maintenance, so will have to restart the SQL Server.  

    Either of these may clear up the issue.  We've got another week until we do the normal upload.

  • Does the Excel file's 'date modified' change after the first (failed) load?
    Is the Excel file locked (can you open it for editing) after the first (failed) load?
    If not, we can assume that something is going on with DTExec and any subprocesses it spawns, I would think. 
    So we would expect the process to fail again after the server has been restarted (or are you just restarting the SQL Server service?)

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • The restart is just SQL Server.   And it is a one-time thing, only because we are moving tempdb to its own drive (and adding additional data files).   
    I'll watch the Excel modified by dates next week when we run the upload again. 
    When we first download the report from the vendor, it is locked and we have to tell it to allow editing.  After we unlock it, we edit it and then save it over the top of the old file.  I'll have to double check and see if it's re-locking at that point.
    I don't know what DTExec is, but I'll research it before asking you about it.
    Thanks for these ideas!  We'll see what I learn next week.  I'm suspicious of the locked status now and annoyed I didn't think to check that.  🙂

  • The file is being saved as Read-Only.  It's related to issues I'm now finding posted all over the internet:  saving files across a file share. 

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply