SQL Agent Job runs < hour for a few days, then routinely takes 8 hours until reboot.

  • Hello there!

    We have a SQL Agent Job that runs once a day at 1:15 AM PDT that purges records from a few tables. It's essentially just a maintenance task. This job will complete in less than an hour for a few days (like 3-5 days), and then suddenly it will start taking somewhere between 7 and 8 hours to complete. Once we reboot the server, the job run time returns to less than an hour. I suspect that restarting the SQL Server service would also resolve the problem, but we haven't tested that. Anyone have any idea why a job would run fine like this for a few days and then suddenly take 8 times the amount of time to run?

    Thanks,

    William

  • It could be a huge number of things. No way to tell you based on the information we have.

    I'd suggest monitoring the system in order to understand what's happening with memory, disk i/o and cpu, especially while this process runs. I'd also suggest checking the wait statistics to understand what's causing the server to run slow (you can use sys.dm_os_wait_stats before and after the process runs, or use extended events to capture the wait metrics for the query itself, I'd prefer the latter). I would also check for blocking when the process is running slow. If you don't have information about what's happening, solutions will just be guesses.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • As Grant said, there are many reasons why this can happen. My usual thinking would be locking, if there are other scripts, procs, etc. hitting the table at the same time. I've had problems in the past where some applications aren't committing transactions and are just sleeping which causes locking problems for quite awhile.

  • As mentioned before, this looks like blocking, typical signs are high signal waits with relatively low actual cpu usage but following the same pattern. The reason for it to suddenly go from a relatively short time to much longer is probably caused by hitting a threshold, once the blocked spids start to accumulate it snowballs.

    Code anti-patterns to look for are mixture of DML and DDL in the same modules, such as DELETE and ALTER INDEX.

    😎

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply