July 30, 2014 at 9:58 pm
Comments posted to this topic are about the item Monitoring for Non Existent Events
July 31, 2014 at 3:02 am
See as so many of us will have studied state machines in one way or another, it remains astounding that so many of us, so many times do not setup monitoring to cater for all states. This is rather lax of us and very remiss. It is hardly surprising when not enough consideration is paid to instrumentation in general.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
July 31, 2014 at 6:31 am
It takes a little more time, but changing the exception handling process for a job from "wait for users to notify the right person" to "have SQL Server notify the right person" is definitely worthwhile.
I have checks for jobs that haven't run or haven't finished in a reasonable period of time but we can always use more. It's the irregular processes that run once a month that are hard to pin down.
July 31, 2014 at 6:32 am
That's one of the points in my DB corruption presentation. Don't assume everything's OK. Don't assume that the backups are succeeding just because you're not getting backup failure messages.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 31, 2014 at 7:37 am
Thanks to SQL Sentry we have built on to our job step -- alert-- and can now check for jobs that run too long. But I do see one more hole... jobs calling code that is not well built, and it completes with out error, but does not do what you expected... I have had this come up from time to time.. the latest issue was bad data flowing into the database and was not picked up until the weekly maint schedule found that the data did not match the data type. One other time we had a trigger that got turned off.. and not turned back on...
David
July 31, 2014 at 9:04 am
"A job that runs long or doesn't run at all can sting just as bad as one that fails."
What's the difference?
There is no difference.
July 31, 2014 at 9:40 am
Developers have in the past cursed bean counters for various reasons. However there are reasons for counting and tracking. Maybe we need to regularly check and see if we have the appropriate beans.
Not all gray hairs are Dinosaurs!
July 31, 2014 at 10:42 am
GoofyGuy (7/31/2014)
"A job that runs long or doesn't run at all can sting just as bad as one that fails."What's the difference?
There is no difference.
Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.
One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.
These all can cause problems, but there certainly is a difference in many cases. Not all, but many.
July 31, 2014 at 10:45 am
Steve Jones - SSC Editor (7/31/2014)
GoofyGuy (7/31/2014)
"A job that runs long or doesn't run at all can sting just as bad as one that fails."What's the difference?
There is no difference.
Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.
One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.
These all can cause problems, but there certainly is a difference in many cases. Not all, but many.
All three cases represent failure to design and test properly. There is no difference, in my mind, from that perspective.
July 31, 2014 at 10:52 am
Miles Neale (7/31/2014)
Developers have in the past cursed bean counters for various reasons. However there are reasons for counting and tracking. Maybe we need to regularly check and see if we have the appropriate beans.
Cool beans.
If the bean counter is our check and balance, maybe that is an indicator we haven't been doing enough testing and verification on our part.:-D
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 31, 2014 at 10:53 am
dwilliscp (7/31/2014)
Thanks to SQL Sentry we have built on to our job step -- alert-- and can now check for jobs that run too long. But I do see one more hole... jobs calling code that is not well built, and it completes with out error, but does not do what you expected... I have had this come up from time to time.. the latest issue was bad data flowing into the database and was not picked up until the weekly maint schedule found that the data did not match the data type. One other time we had a trigger that got turned off.. and not turned back on...David
That is such a pain. When that happens it usually involves implementing an additional process to verify success and alert if there is a smell of failure.
I'd rather add the extra checks and code to ensure less headache down the road.:cool:
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 31, 2014 at 8:35 pm
GoofyGuy (7/31/2014)
Steve Jones - SSC Editor (7/31/2014)
GoofyGuy (7/31/2014)
"A job that runs long or doesn't run at all can sting just as bad as one that fails."What's the difference?
There is no difference.
Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.
One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.
These all can cause problems, but there certainly is a difference in many cases. Not all, but many.
All three cases represent failure to design and test properly. There is no difference, in my mind, from that perspective.
Perhaps - but the fallout of a partially completed job can be substantially harder to recover from, than, say, the job didn't run because someone disabled the scheduler.
Also - depending on the type of process you're dealing with, it may not be physically possible to test every single permutation, so - yes in some cases you might not be able to completely dummy-proof or fail-proof some jobs.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
August 1, 2014 at 8:44 am
the difference between a proactive or reactive state.
- or perhaps - "Isn't it the users job to tell the DBA when a job did not finish?"
The more you are prepared, the less you need it.
August 1, 2014 at 9:53 am
Andrew..Peterson (8/1/2014)
The more you are prepared, the less you need it."
Andrew, the last line in your post reminded me of something from years ago. For the jobs that failed we use to use checkpoint/restart features, such that we kept reference of state and when the job was restarted it resumed execution from that state/checkpoint. Using this technique, we were able to save a tremendous amount of time in those old Big Iron days.
Also in those jobs that failed to complete and just ran in loops, the last checkpoint may have been the one that caused a loop, wait, or other anomaly. If I remember rightly, and it has been a number of years, we could determine the state of the last correctly completed function or record, and then fool the checkpoint to restart just after the last successful process after the data error or other logic was fixed that caused the problem.
I wonder after reading this series of posts if the old technique of checkpoint restart has been lost, or forgotten.
Not all gray hairs are Dinosaurs!
August 4, 2014 at 9:26 am
Matt Miller wrote:
... depending on the type of process you're dealing with, it may not be physically possible to test every single permutation, so - yes in some cases you might not be able to completely dummy-proof or fail-proof some jobs.
Maybe not, but it's no excuse to bypass developing the appropriate test cases, either.
The time spent actually writing software should be almost vanishingly small, compared to the time expended on design up front and testing in back.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply