About a year ago our unit started processing a new job for a big customer. To attain our commitment on job completion timelines, several alerts were introduced. These alerts monitored various stages of the job’s 1,000s of files including areas such as incoming and processing of raw data and the (e)mailing of the customer statements.
Alerts indicating success, hold and failures are emailed automatically from our processing engine to all team members. These alerts are emailed at anytime day or night, with sometimes upwards of two hundred alerts transmitted per day. After a month or so the monitoring of these hundred of alerts became impractical. Alerts with important information were easily overlooked. Naturally this resulted in production delays. Team meetings were held and remedial action taken to add more alerts on top of the alerts to remind us of alerts. Again another couple of weeks passed with crucial alerts missed.
This led me to rethink how our alerts are generated and the analogy of a car alarm came to mind. How often do you hear car alarms going off? The first time you hear your alarm you may run to take a look, checking if anyone’s breaking into your car. The second time a quick glance will suffice. By the third time your alarm goes off, you’re thinking that the alarm is broken or something’s just messed up and immediately disarm without looking. If it makes it to ten false alarms, you are now so angry that you want to go outside and destroy the alarm. Although you had good intentions when you installed the alarm, the many false alerts will cause you to tune out or even ignore them. Obviously the result is your prized possession is now vulnerable for anyone to take it for a ride.
Unfortunately, the same thing happens in IT operations. When alerts and alarms are not completely thought out as to the purpose, important alerts get ignored. Just visualize the hundreds of alerts occurring every second in IT monitoring. Ensuring correct threshold will help to reduce if not eliminate false reports of virus, passwords, intrusion detections and network failures. Exception alerts will also assist in maintaining vigil monitoring; these types of alerts indicate when events are missed rather than when events have happened.
Since recognising this problem we have revisited our system of alerts to
- Have relevant subject names with relevant descriptions
- Consolidate into summary alerts
- Create exception alerts
- Include as much relevant information as possible in the alert
- Regularly review the effectiveness of the alerts.
What are your views and opinions on alert fatigue?