DevOps

Actionable Alerts: Reducing False Positives & Making On-Call Suck Less

fen (historic)

01 Mar 2016 — 3 min read

This was originally a guest post written for the VictorOps Blog.

In the world of IT Operations, there is no escaping the Dreaded On-Call. Someone has to keep a look out at night to make sure the business continues to run and we're not all going to get a nasty surprise come the morning.

Fortunately, we don't have to manually keep an eye on everything as we have applications and daemons that can watch for problems and send us notifications when something is wrong. The problem is that the computer programs are only as discretional as they're programmed to be. This causes the major stressor of on-call: False Positives.

Nothing is worse than having your phone wake you up multiple times at night just to tell you something that either doesn't matter or can't be fixed until morning anyway. The way I've found to be most effective in reducing these alerts is to reduce what I'm alerting about.

Before creating an alert condition, I mentally cycle through a number of questions to see whether the alert will be meaningful and useful…

Does it matter?

First and foremost, do I even care about this information? If a server alert comes in and it makes no difference to you, you shouldn't be receiving it. For example, we had a default set of alerts that we applied to every new server. But we've got a couple of Test servers out there that really weren't running anything in production and were being touched and monitored manually for performance. Getting that alert at midnight that disk I/O had been high for 5 minutes, really didn't matter to us. We were going to look at this the next day anyway, and we just don't really care about performance issues on these machines. This should not be waking up your on-call tech.

Can I do anything about it?

Even if I do care about the problem, is there anything I can do about it? This has more to do with who is getting an alert than whether or not it should be submitted. If Jeff The Help Desk Tech can't fix Super Important Business Server when it's processes lock up, he should not be the one getting the alerts for Super Important Business Server. All that does is make him cranky. Make sure the right techs are getting the right alerts.

Can this wait/Is this just informational?

Even if this is an issue that matters and I can do something about it, do I need to know about it in the middle of the night and act on it immediately?

There are two types of alerts that really should not be getting to your On-Call technicians: Informational and Low Priority.

If this is something you just “want to know” even if you can't fix it, this is Informational. This should not be waking anyone up or going through your on-call alerting system. There should be another alert path that just logs this information to be read at a later date.

If this is something that can “wait 'til later” it's a low priority alert. These types of alerts should notify your standard help desk through normal methods rather than ping out to your On-Call alerting system. It'll let you know there's a problem during regular business hours, but it won't wake you in a panic because your Hard Drive will be full in a week. I don't need to fix that now, I need to fix it within the week. For now, I can sleep and think about it more clearly in the morning.

Actionable Alerts

Can this be automated?

Finally, can you make an alert be turned into something self-healing? If, for instance, every time you get a High Memory Usage alert, you check on a known leaky application and reboot its daemon, maybe it's possible we can create a process that is automatically kicked off when an alert would be. It's possible you still want to know about this occurring, but this becomes informational now and can be logged instead of acted on. Your DevOps team can continue to get rest and work on how to fix the bug rather than spending all their time and energy babysitting a misbehaving process.

Anything else?

There are probably a dozen other really good questions you could ask before determining whether or not an alert is a good alert, but the rule of thumb is: If the on-call tech receives the alert, can he take action?

If the answer is “No” then he doesn't need to receive the alert. It sucks to have to go to work late at night to fix problems, but it sucks way more to have to wake up and do nothing except acknowledge an alert.