Skeptics in the Church of Data, Pt. 1: Ein Minuten Bitte
This will be the first in a series of blog entries from me about changing how you think about monitoring in your tech business. All of this comes from the talk I gave at GalaxZ'17 this year in Austin, TX and originally published at cagedata.com/blog, but as with other things, I want to collect it on my personal site as well.
I'll setup the general monitoring situation as it is, and present to you my thoughts about what needs to change. In further parts of this series I'll actually get in a little further about how to make these changes. And now keep reading for Part 1, What sucks and why:
I come from an operations background. I've worked jobs where I've had to carry the pager and, in general, it sucks. There's probably not a way to make this ever not suck in some capacity. But that hasn't stopped me from trying to make it suck less.
We all have incidents. They're unavoidable and they're no fun. They happen at all hours. They wake us up and interrupt dinner. Incidents affect our lives. Worse, though, is a total disaster. When an incident is big enough it becomes a real potentially business-ending problem and that sucks even more. You're going to deal with the wrath of customers, executives, and investors. So we work to catch this as early as possible through monitoring our systems. And we have monitored religiously.
And this religion has grown. When it comes to tracking systems there's probably no clearer trinity than: Processor, Memory, and Disk. The fact that we don't give servers a water baptism is probably more due to the fact that they're moisture averse. Otherwise, I mean, anything to avoid downtime, right?
So we continue to monitor, Make sure the system is up, make sure the processor is OK, check for errant processes, Make sure memory is available, make sure our disks are not full and not overused. We have our systems confess their aberrant ways, we alert and snap to action. We repent, correct the systems, and continue on the Straight Path towards 100% uptime.
Our religious monitoring even has its sacraments: We baptize our systems with some monitoring like Zenoss, have them confess VictorOps alerts, and we help them repent and continue on the Straight-and-Narrow Path towards 100% uptime, our version of Nirvana. And sometimes this even serves to help the business.
"Ein Minuten Bitte"
But, here is where I begin as Martin Luther. Except I don't have 99 theses and nailing them to a church door seems both low tech for this conversation and like a prosecutable offense. I'll settle, instead, for a blog post on the internet. Monitoring our system health is no longer relevant in a modern technical infrastructure (and it was only coincidentally relevant in classic IT infrastructure).
When we're monitoring System Health it's only relevant if we have systems with health to be monitored. When we're monitoring servers in a rack, there's at least some logic to checking on every system's basic vitals and reporting it. I want to know if a box is failing because it will affect my applications. Hopefully I can catch it before it fails and save downtime. But even then, we've gotten to monitoring in a way that is monitoring for monitors' sakes and not monitoring to help the business. Furthermore, when we no longer have physical boxes, when increasingly Virtual Servers or more recently containers are being pushed out to run our critical applications and they can elastically assign resources from a massive pool. We've got systems now that can auto-size and auto-cluster to ensure we've got the right resources when we need them. Monitoring single system health becomes irrelevant when your infrastructure is immutable and automated. Something is sick? You scale, or cluster a new instance and kill the old one without any real interruption. And it can be automated so no one even has to look at it.
There's no real problem with just collecting this data, you can't automate something unless you know the triggers. And it's good to have for troubleshooting and fixing problems, but it's no longer the most important data we can collect. We need to care about something more in our increasingly complex environments.
Alert fatigue has to be top for contributing to terrible On-Call experiences. VictorOps puts out an annual "State of the On-Call" at the end of each year to see how the industry is moving and improving, similar to Puppet Labs "State of DevOps", but laser-focussed on the On-Call experience. VictorOps's report for 2016, published the beginning of this year, shows that 61% of respondents state, "alert fatigue is an issue at their organization" and that this is "on par for the past 2 years." For 3 years, in total, almost two-thirds of participants report that they're receiving enough alerts to cause fatigue; enough notice that you start to get tired of it. If you receive this much noise from your monitors, you begin to ignore your monitors.
In college, I was in a dorm that had construction going on in the building. At some point they were working on or near the fire alarm systems and consequently set them off periodically throughout the day for about a week. The first time it goes off, we all evacuate properly. After only a short while abandoned on the streets of Boston in late-fall, we're let back in and told all is well, false alarms, construction, etc. When class got interrupted, it wasn't so bad. Eventually, though, we evacuate a bit more slowly, because it's New England, it's cold, and anyway, it's probably not a real fire.
I am now experiencing literal alert fatigue
Then the time came that the alarm went off while I was in my dorm at night. I was on the sixth floor and about halfway down the stairs and the alarm stops. So we begin going back up, only for the alarm to start again once we reach the sixth floor. I am now experiencing literal alert fatigue. It's 1am, I don't need to be trying to see how many flights of stairs I can climb and descend before daylight, screw it, I'm just going back to my room.
Now we have so many alarms, people began ignoring them. Our RAs had to make the announcement: "You must evacuate in the event of a fire alarm, we will be checking rooms." Your alerts should never reach this point. (For those who need closure, it got sorted within a week and our lives were better for it. Also more safe.)
So how can we avoid this same phenomenon with our systems? I can, carry on writing about how to improve alerting systems and making alerts meaningful, but I want to get more fundamental than that. What we're monitoring is what we're alerting on. We need to make sure our monitoring is as meaningful as possible.
We've setup the conversation. The way we're doing, sucks. It's not great for businesses and it's causing fatigue in the IT industry, so how can we change it? In part two of this series I'll begin to wrap some the conversation around where we can start making changes in our organization to improve our monitoring systems.