March 1, 2011 at 6:42 am
Hello,
Over the weekend....and during a 3 step backup job....the cluster service failed due to antivirus reading into the quorum drive. I now have the problem where the backup job seems to be orphaned but still running. We use Quest LiteSpeed to backup these databases but the Quest console shows the job completed after finishing step 1 but no regard to step 2 and 3.
The job activity monitor shows the job completed the previous week's run but nothing about the recent weekend....but if you go into the history there is an entry for this job with a running symbol. The date colums shows the current date and time, there is no step number, no server name, no step name, a runtime of over 2 days and the message of "In Progress". There is no option to stop the job anywhere.
I have checked all sessions in SQL server and every session has a login time of after the failover so none date back to when this job started. Im not sure if the SID has been reallocated causing this problem, whether there is something i can do when the session does not appear in SP_WHO2, or if a server reboot will help now.
Thanks
LilyWhites
March 1, 2011 at 7:36 am
Is there any option in the Quest console to stop the job
M&M
March 1, 2011 at 7:38 am
no....the quest console shows the job as "succeeded" although only 1 step has been logged.
i did stop the Quest services on the server but it made no difference.
March 1, 2011 at 8:00 am
My guess is that you will have to kill the instance to clear it if you stopped the backup agent service and you still have the thread open. I had this with another vendor when doing some testing, not your vendor and not Red-Gate, and that is why I didn't use them.
I would also make sure that I called them to let them know about the issue. That is a huge issue as anytime a 3rd party product leaves something running that can't be killed without restarting the instance is a huge issue in my opinion. If that hits a production box and you have to suffer an outage over that, well, goodbye to that vendor.
I have not had that ever with Red-Gate which is one of the reasons I am as loyal to their backup products as I am.
David
@SQLTentmaker“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot
March 1, 2011 at 8:03 am
thanks David,
i have kind of resigned myself to restarting the instance as soon as possible....looks like that will be tomorrow night now
i will update you all to let you know if it works!
thanks again
lilywhites
March 1, 2011 at 8:23 am
lilywhites (3/1/2011)
thanks David,i have kind of resigned myself to restarting the instance as soon as possible....looks like that will be tomorrow night now
i will update you all to let you know if it works!
thanks again
lilywhites
Yeah - agree that is probably what has to be done. Not to sound naggy but I would make sure that the vendor was very clear that you had to reboot in order to clear this.
David
@SQLTentmaker“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot
March 1, 2011 at 8:29 am
i will inform them.....we just lost our support a couple of weeks ago so this is perfect timing!! not sure how they will accept a reported bug from an unsupported customer but i will let them know.
much appreciated
March 3, 2011 at 3:41 am
Hello,
Just to inform you all that i have solved this issue. The server cluster was rebooted/failed-over last night but that did not solve the orphaned job step and it was still showing as running this morning.
I checked all the system tables and saw some inconsistencies which i didnt want to change manually....so thought if i could get the job to run and complete it should update all the statuses and resolve the issue.....and i was correct!!
I altered the job to only run the shortest step and ran it. It completed and returned a successful status....upon looking at the history i can see that the orphaned step from the weekend has been picked up by this job run and is included in the step history for today.....it looks a bit weird having a step that started on the 27th, ran for 12 seconds but completed today.....but from what i can tell all the stauses correlate and all looks good!!
Thanks for your help
March 3, 2011 at 9:02 am
Well, that is definitely odd.... 🙂 Glad that you fixed it though.
Out of curiosity was there a system process from the backup agent running when the instance was failed over? That would be the only way that I could see that coming back up with the instance stop / start. It is odd though that it was still associated with that job step.
Regardless, glad it was fixed and thanks for the follow-up information.
David
@SQLTentmaker“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot
March 3, 2011 at 9:42 am
Hi David,
Yes there was a system process but i assume that due to the windows cluster service actually failing rather than processing the failover, the failover was not clean and probably caused the backup service to have a bit of a tantrum!!
I will look at this closely while i wait for an answer from Quest as to whether they can replicate the issue.
Thanks for your help
LilyWhites
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply