Skeptics in the Church of Data, pt.2: Empathetic Monitoring
This is part two in a series about monitoring stemming from a talk that I gave in Austin, TX at GalaxZ'17 and published on Cage Data's Blog. In part one I initially brought up the point of what I think is terrible in the current world of monitoring and why I think it should be changed. Here we'll look more closely at one way how to do that. Today we talk about...
Empathetic Monitoring
Where we're at right now, we're tracking our System Metrics. Hopefully many of us are monitoring more than just the actual box health and are tracking things like User Latency, Database Performance and page load times, but it's still fight with our businesses to see the value of the data. We need to start monitoring Business Metrics. When you tell an organization that you want to start monitoring the vital signs they care about it's a much easier sell.
A friend of mine, Leon Fayer (OmniTI), has a great story about this. He's got a client that at some point comes to him and says, "Hey! What's wrong? We're losing revenue!". And, how do you really even respond to that, besides, "OK, um… did you sell enough? Maybe your sales team isn't doing their job?" which is really not the way to handle it. Leon would rather at least look at the data before starting fights. So he pulls out his monitoring tools and sees, sure enough, there's a revenue drop. OK. Great. Let's see what we can correlate. Let's look at some system health metrics like customer latency… no everything looks good there. Database performance? Looks fine.
The systems are working fine, everything looks normal systemically. Finally Leon is able to pull up Credit Card Decline Rate Percentage. And once we get here there's a clear correlation between declined credit card transactions and a decrease in revenue.
Leon brings these findings back to the client and the client says, "Oh yeah, we changed processors for American Express, we don't use that processor anymore." Communication issues aside, they found the problem, and corrected it bringing things back in to normal. They solved a major problem because they were monitoring things that were important to the business, not just important to their technology operations. Being able to correlate a drop in revenue with an increase in credit card declines makes you a business insight super-hero, not just that guy in IT.
"Awesome!" you say, "I want to be a superhero! But where do I start? What do I even monitor?"
"That's a great question," I reply, "You ask great questions that I just happen to be prepared to answer."
This is going to sound like a revolutionary idea, but bear with me. Have you tried talking to your business leaders? I mean, besides when they show up on a tirade, complaining about lost revenue and nothing ever working, have you tried sitting down with leaders in the business and asking what they care about? I think it'll surprise you to learn how far off your monitoring system is from what matters to the business.
Real talk, as a business leader, as someone driving the company day to day, do you think, they care about uptime? A server with 5-nines is no better than a server with 0 assuming there's no change in revenue. Or as the band Cake put it in Italian Leather Sofa, "She doesn't care whether or not [the data center is literally on fire], just as long as [the] ship's coming in." (lightly interpreted)
So do you, as IT professionals, know what a good day looks like to your marketing team? How about what a bad day looks like? What about to one of your finance folks? Or a sales? If you're an MSP, do you know what is actually happening for your clients when they say they're having a good/bad month? This is exactly where you can start the conversation.
Invite people from your other business segments out to lunch or coffee and have a conversation. Spoiler alert: about 90% of what you will be doing in this conversation is listening. If you have no idea what to say, that's fine; start with the following:
- What does your job look like on an average day?
- What does a good day look like to you?
- What does a bad day look like to you?
Then, really hit the meaningful question here:
- If you could track anything to help you do your work better, what would it be?
If you can get these answers, especially for business leaders and department heads, you are well on your way to expanding your monitoring system to the most indispensable piece of technology in your company.
All of this isn't to say, that you should no longer be monitoring the technical things. It's important to know that the decrease in revenue was caused by a non-performant DB. It's good to know about failing hardware before it becomes a Real ProblemTM. This is still important information. And heading off incidents before they become Real Problems is still important, just understand that it's all about context; being able to correlate business outcomes with that technical data is what's relevant. That is what makes you valuable to the company and ultimately gets you paid for your work.
When you're not just monitoring system health, but real business metrics and actually providing this information back to people that need it, you're no longer just keeping the lights on. You're providing real business value. Your monitoring system isn't just an expenditure, now, it's an investment in a tool that helps increase revenue by providing feedback to the business as a whole.
Once other teams see this value, you begin to gain converts, members of your company who really believe in monitoring as a Good ThingTM. And ideally they'll start to share that, "Oh you should see what our Technology team did! They got me some great information so now I know how effective our procedures are or if they're just wasting time. You should ask them for X…" If you're really lucky, your converts, in turn, become your loudest and most passionate evangelists for your cause. They believe in the value you add to the company and they're spreading the good word. All it took was listening to their problems, and applying some of your technical know-how to make their lives a little better.
At this point we're no longer just keeping the lights on and making sure the server response times are good, we're actually gaining understanding and providing insight into how the business functions. But where should we start looking if we want to begin making this impact? This is exactly the area I'll discuss in greater detail in part three, so check back for that update.