Backup log succesful in 28 secs, but job is not complete after 4H40

  • I don't have much to go on. I've had this job running for the last 6 months without any issue. It was logged as successfull run after 4H40 minutes when the server was rebooted. Nothing in the logs except jobs successful.

    Then lately the full backup (another job) failed because the drive ran out of space because someone decided to keep a few extra copies of the full backup. However that job succeeded this morning.

    The job is a simple backup log that overwrites the file if it exists, then moves the file to a shared folder on the san.

    The job was still running after 4H40 even if it took 28 seconds to backup and save the file and the job was reaking havoc on the system where nobody was able to post anything in the system untill the server was rebooted.

    I didn't have time to look for a deadlock because the reboot was done by someone else while he was talking to me over the phone.

    The DBA on site told me that he cleared some space on the backup drive as he realized that was part of the problem, but then the log backup file kept rising by like 1 byte a second without ever completing.

    I've seen plenty of backups fail, but it's usually pretty quick and "PAINLESS".

    Anybody knows what could have caused this?

  • Ninja's_RGR'us (5/12/2010)


    Nothing in the logs except jobs successful.

    what a about sql error log ? or sql agent log ?

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Nothing in windows event logs nor sql logs nor sql agents...

    Everything is "A-OK" but the job never finished untill we did a forced shutdown... which then the agent stopped te job and recorded success even if the backup succeeded over 280 minutes earlier.

    We didn't test the log backup then but we were able to take another one since after another full backup so I just have no freaking clue of what went wrong there... and it caused almost 7 hours of outage in production hours (the users could barely read the data, but not do any data changes).

    I'd like to think that everythings going to be ok but I'm not sure that explaination is good enough... for anyone :ermm:

  • Ok let's approach from this angle :

    If it happens again what the heck can I do to figure out the problem?

    Right now I'm thinking MS support. But is there anything else I can do?

  • Were other jobs starting & stopping normally? If not then the sql agent was hung and can be restarted in the Control Panel/Services without a server restart.

    I used to have SQL 7 server where some jobs would hang like you describe, never did find out why but rather than rebooting the server I'd go into the server's Task Mgr and kill the process. Sorry I don't have the answer you're looking for but this workaround is less painful than hours of downtime and finally a reboot.

  • That was the only job that hang.

    Also in all logs, there's no notice that the agent was ever stopped / started other than the forced reboots we did on the server.

    Any other ideas?

  • If the backup completed, I don't think it's SQL Server. How is the file moved? Could there be something in there waiting on a response somehow? Some interactive thing?

    Is this multiple steps?

    What I'd do is add job steps in between each step you have, log something so that you can track what's happening, maybe grab disk space there and record it.

  • Steve Jones - Editor (5/13/2010)


    If the backup completed, I don't think it's SQL Server. How is the file moved? Could there be something in there waiting on a response somehow? Some interactive thing?

    Is this multiple steps?

    What I'd do is add job steps in between each step you have, log something so that you can track what's happening, maybe grab disk space there and record it.

    That's a good idea.

    We do backups on the local drive, then move to network on a san (all in a single step).

    That's why I was pointing out to the local DBA that there must be something wrong with the hardware because it doesn't make sense any other way... SQL server's been around for 10 releases, and the backup command in our environement has been there for 3 years, 2 on that drive so it makes no sens for it to be SQL Server (unless it was a new patch which we have not installed in a couple months).

    Now Steve can you think of any reasons why the move of a ±250 mb file on a super fast SAN could lock up all transactions on the whole server? The only answer I can come up with is that either the san is not working or the network is blocked somehow.

    However that could explain SLOW response but not a 99% deadlocks on all transactions... unless the san is dead... which the DBA swore that it was working correctly along with the network.

  • What's the code?

    The move shouldn't lock of the server. That makes me suspect coincidence and there's another issue.

  • Steve Jones - Editor (5/13/2010)


    What's the code?

    The move shouldn't lock of the server. That makes me suspect coincidence and there's another issue.

    Glad to see I'm not insane :w00t:.

    Here's the code

    DECLARE @BkName AS VARCHAR(250)

    SET @BkName = 'D:\PROD-FORDIA-TransactionLogs.bak'

    BACKUP LOG [PROD-FORDIA] TO DISK = @BkName WITH NOFORMAT, INIT, SKIP, NOREWIND, NOUNLOAD, STATS = 10

    EXECUTE xp_cmdshell 'MOVE D:\PROD-FORDIA-TransactionLogs.bak "\\Fordiavcenter\Informatique\Navision backup\"'

    What would you trace besides disk space?... which now I think is either a mere coincidence or a catalyst in a weird chain of events!

  • So the backup log part finishes and it starts the file move? Turn on perfmon and watch and make sure there is disk i/o, network i/o. Your SAN guy should also do the same with the SAN tools he has. You may have an issue with the HBA or with possibly on the LUN. Has any of the SAN's firmware been upgraded lately?

    ---------------------------------------------------------------------
    Use Full Links:
    KB Article from Microsoft on how to ask a question on a Forum

  • Trey has a good point. SysInternals has some good tools for tracking handles on the OS side to see what's being used.

  • Trey Staker (5/13/2010)


    So the backup log part finishes and it starts the file move? Turn on perfmon and watch and make sure there is disk i/o, network i/o. Your SAN guy should also do the same with the SAN tools he has. You may have an issue with the HBA or with possibly on the LUN. Has any of the SAN's firmware been upgraded lately?

    I've never used those. Any articles to guide me on the process?

  • Depends on OS, but System Monitor/Performance Monitor, or Task Manager will show you activity.

    Some other utlities here: http://technet.microsoft.com/en-us/sysinternals/bb545046.aspx

  • Ninja's_RGR'us (5/13/2010)


    Steve Jones - Editor (5/13/2010)


    What's the code?

    The move shouldn't lock of the server. That makes me suspect coincidence and there's another issue.

    Glad to see I'm not insane :w00t:.

    Here's the code

    DECLARE @BkName AS VARCHAR(250)

    SET @BkName = 'D:\PROD-FORDIA-TransactionLogs.bak'

    BACKUP LOG [PROD-FORDIA] TO DISK = @BkName WITH NOFORMAT, INIT, SKIP, NOREWIND, NOUNLOAD, STATS = 10

    EXECUTE xp_cmdshell 'MOVE D:\PROD-FORDIA-TransactionLogs.bak "\\Fordiavcenter\Informatique\Navision backup\"'

    What would you trace besides disk space?... which now I think is either a mere coincidence or a catalyst in a weird chain of events!

    I like Steve's idea of breaking the job down into two parts: the backup and the move. Set up each step to log the activity to an output file. This should help you determine which part of the step is causing the issue.

    You might also want to look into using Robocopy vs. Move. There are some serious enhancements in it vs. the other DOS tools. Check it out at http://technet.microsoft.com/en-us/library/cc733145(WS.10).aspx.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply