24/7 To Kill or not To Kill

  • I came acrross a problematic stored porcedure who is taking a LONG time to execute  and the system is bogged down I was asked what to do in that case? Because the shop is 24 / 7 restart the service is not an option is the answer to kill the spid or is there something else that can be done?

     

     


    * Noel

  • Well if a spid is stuck directly related to the SP your option is to kill or restart (you could also throw the users out of the DB with single user mode and options available in SQL 2000) but net effect is still knocking everyone out of the DB. Even then killing the spid and trying to set single user mode on the DB may not work so restart then is your only option. The longer you potentially let a poor process run the more likely you will be forced to reboot anyway.

    Now if it happened with a particular SP I would try to get as much detail as possible about it to correct the situation before further occurrance.

    Also, make sure the current state of the SPID is something that would point to being actively running as sometimes connections remain and you are actually viewing the last batch (process) that occurred on the spid which may not actually be the issue you thought (simply killing the spid then should be fine).

  • Just my 2ct

    first use profiler to determine what's going on.

    Activate those switches that inform which sql is being used within the sp.

    this will put more load on your box, but the docter can only diagnose if there is patient that is feeling ill

    also take extra attention for table/idex-scans because the can consume a bunch of I/O.

    If you kill the spid, keep following it up, because it may take a while for the processes rollback to complete. If it takes way to long according to what you'd expect, stop/start sqlserver can speed it up, but that will cost some complete downtime

    If it is only one user that uses this sp, maybe you can revoke/deny exec permission to avoid this performance-hit during the time you are investigating the issue and the solution is being worked on.

    If you haven't enabled auto-statistics, run sp_updatestats and dbcc updateusage,

    then alter the proc so it runs with recompile, run it and remove the recompile.

    Maybe even clear the procedurecache, so all proc's are being recompiled with the new stats. (but this may even highlight other problems as well) 

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • The probelm seems to be that a procedure got slower (a lot)  and due to that I think an index defrag would help. Now, I am NOT allowed to stop the service and I know that rebuilding the index is going to lock the activity on that table (100 million rows!) Is there ANYTHING else that can be done ?

     


    * Noel

  • Yes, you can run DBCC INDEXDEFRAG. Look it up in the Books OnLine. This is a three step process, first it compresses the index, then it defrags/repages it, then it 'releases' the free space back to the operating system (SHRINKDATABASE). This last step increases the TEMPDB so keep an eye on it.

    As for what I do when there is an issue with a script/procedure being run? I check it out in Enterprise Manager, find out the SPID, run DBCC INPUTBUFFER(spid) to get an idea of what the script is. Then I tell the user that I'm KILLing their script because it's causing problems, but that I will check it out for them. Then I KILL the spid. I then set up Profiler and rund the script myself. I check it with Query Analyzer's Estimated Execution Plan first. I do my best to 'fix' the script for the user and then 'return' it to them.

    -SQLBill

  • SQLBill,

    Thank you very much for your input. I would do exactly as you mentioned but I am NOT allowed to change ANY SQL Code in production until tested and validated on QA. That is the reason of my question! What else can be done without modifying the SQL of the underperformant or hanged procedure? 

    PS: Keep in mind the 24x7 part

     I definetly wellcome your comments on this


    * Noel

  • I understand the part about testing/validating and agree with it, but sometimes that is your only option. I run 24x7 Unix databases on DB2 and sometimes we have to bite the bullet and restart if performance or production is affected.

    That being said, a few things mentioned above are great ideas.

    • Run profiler and watch what the SPID is doing
    • Once you kill it, watch the sp_who results to see if it is rolling back
    • If this is SQL 2000, you can defrag online, won't lock the table. Again, you might want to test in QA to get a warm fuzzy.
    • Look at the SQL and start fixing it in test/QA, so it can get rolled to production.
  • Yes, unfortunately you can say 24/7 all you like but ultimately you have to fix the problem somehow so I suggest you bark for a maintainence window. If they still say no then your hands are tied by the powers that be and until the system stalls itself (which may happen) there is nothing you will be allowed to do. If they are adamant about uptime then when do you guys do patching of the OS and SQL, or do you?

    I suggest start researching a fix and let them know there is high importance on a short term (restarting) and long term (code fix, or system redesign with backup service so maitainence can be performed regularly without seemingly much downtime) need here or the system my become too unstable for use anyway. Try to build enough supporting documentation to present and have a backup I told you so if they choose not to listen.

    Sorry but that is your only solution.

  • To clarify what I said....when I have a user who is 'hogging' up the system with a query; I KILL it. But I tell them first and have them send me the query they are trying to run.

    I review the script and run it myself in Query Analyzer, using the Estimated Execution Plan and Profiler, I see what it is doing and how it's hogging the system. I then either make changes myself or direct the user what to change AND I TELL THEM WHY.

    My users are now good enough that they let me know when their query seems to be slow and they immediately send me a copy of the query to look over. They cooperate with me because I help them out. They know I won't just KILL their script for no reason and they know I will do my best to make it run better AND let them know what I find.

    My database is 24/7 everyday of the year. Through baselining my system, I know when the system is getting the least use (there isn't any downtime, ever). When I need to do maintenance, I let all my users know in advance, when and how long it will take. Then I do my best to schedule it during the slowest times possible. But with my system that's almost never. So I will have downtimes when it's not really the best time. My users and management have learned that any time I take the system down it's a good thing.

    -SQLBill

  • THANK YOU Guys for your input!!!!

    When a Patch has to be applied we  apply patches to the pasive, fail over then to the other  apply patch there and we are back to normal. that is tested before rolling out  

    When it is absolutely impossible to get by then the we'll schedule the downtime. All this is comming because Management decided to use a SLA (service level agreement) with some clients and downtime means $$$  so I am trying to familiarize with how people on 24x7 solve this kind of problems that may not happen often but when they do I rather be prepared.

      


    * Noel

  • Well, as the person who asked noelD that question - small world eh? - I would like to chime in on the subject. Hello Noel

    First, a little background. Our shop provides 24/7 online banking sites for both commerical and consumer cardholders, with our biggest competitors being the bank's own internal IT departments. We also host these sites. Along with those 24/7 uptime requirements that Noel mentioned, we also have SLA (Service Level Agreements) with all our clients. Any violation of those, such as an extended outage, which can be as small as an hour a month for the entire month, turns directly into a credit for the customer. Obviously, we have to have monitoring in place that would sound the alarm long before the site goes into an unusable state. However, there are times were an issue jumps up on you, giving you little time to react. In a situation like that, where the customer is already feeling the brunt of a database problem you have to act fast. They, the client, care what the problem is, however, more so, they just want it to go away. Those circumstances dictate that you quickly collect as much information as you can, kill the process, and get started (immediately), offline, on it's resolution. We will owe the client an RCA (Root Cause Analysis) within 24 hours.

    Often the problems are not solved by an index or something as glaring (I won't bore you with my database war stories). If it were, you could always kill the offending process(es), then perform an online defrag of that guilty index.

    The bottom line is there are more than a few "right" answers. It just depends on what is required of your system.

    Antares686 - To answer your question, we have maintenance windows of 4hrs per month. Far to short to properly maintain 1/4 to 1/2 TB websites. However, we do have orchestrated rollouts with the client, which sometimes allow us a small window of additional downtime to get out important fixes. Nothing like being handcuffed by unrealistic contracts.

    Thanks,

    Chris

  • Pay special attention on the db's logfile(s) when you defrag or rebuild the indexes and you are on sql2k with full-recovery mode.

    If you don't have "unlogged"-operations in your normal production environment, you may want to set the db bulk-logged.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • It is really small !

    Chris, I am glad to hear from you!

     The other side of the story is that then 24x7 is really --> 24x6.5

     


    * Noel

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply