An Alert Philosophy

  • Comments posted to this topic are about the item An Alert Philosophy

  • This article reminds me of the alert fatigue that hospital staff (And fast food staff, it tends to be as much of a cacophony behind the counter these days) are subjected to. If everything is beeping at you all the time, the real alerts don't stand out.

    I especially like the idea of refusing to set an alert for something you don't have to do anything about. That's a clear way to improve the signal/noise ratio.


    Puto me cogitare, ergo puto me esse.
    I think that I think, therefore I think that I am.

  • I developed a monitoring service on Windows that monitors a web application on a periodic basis by checking that it returns valid web login page and that it can establish a database connection. When an error is detected it notifies either the server hosting team, the DBA team, or the NOC team by email and then it switches to a longer polling period; once errors are no longer detected, it switches back to the regular polling period.  It has a longer polling period for after hours, weekends, and holidays.

    I've since used the monitoring service as a template to create other monitoring services for web applications that I support.

  • Changing monitoring intervals, or suppressing additional alerts, is something that needs to happen as well. Too easy to keep sending the same thing over and over. However, that's a complex task. What I'd like is a "this is being worked" that limits alerts until someone notes it's fixed or they want alerts again.

     

  • Steve Jones - SSC Editor wrote:

    Changing monitoring intervals, or suppressing additional alerts, is something that needs to happen as well. Too easy to keep sending the same thing over and over. However, that's a complex task. What I'd like is a "this is being worked" that limits alerts until someone notes it's fixed or they want alerts again.  

    I like that idea, having the option to suppress or significantly reduce the frequency of the alerts until the anticipated correction reduces the tendency to build up a reflex of ignoring alerts.


    Puto me cogitare, ergo puto me esse.
    I think that I think, therefore I think that I am.

  • I want to keep sending the alerts out as long as the problem exists; otherwise, it may be viewed as a transient problem. Shifting to a longer polling frequency helps prevent overloading the responding team with alerts. One web application is used to register volunteers for disaster response and it is also used during disasters; the state agency where I work is a first responder for disasters, setting up shelters and taking care of health needs. For that service, I added a "Hurricane Mode" where it uses the normal working hours polling schedule for the weekends and after hours as long as the state's Emergency Management Division is activated.

    Polling interval: Normal hours: 10 minutes, error detected: 60 minutes. After hours, weekend, holidays: 4 hours, error detected: 8 hours.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply