This is the fourth part in a series on shifting our thoughts about Monitoring. All of this came from a talk I gave at GalaxZ'17 in Austin, TX earlier this year and was initially published on (Cage Data)[https://www.cagedata.com/2017/09/26/skeptics-in-the-church-of-data-pt-4-reformation/]. You can find part one, "Ein Minuten Bitte", part 2, "Empathetic Monitoring", and part 3, "New Partners" from their respective links. But if you're ready to read on, we'll talk about...
That's a heavy word. But I want to be clear about how much work goes into this process. There's risk involved here, too. Change, being introduced the wrong way, can cause schism instead of reform. You'll notice in the 16th century the Protestant Reformation didn't create a new and improved Roman Catholic Church, but rather a whole assortment of new Protestant sects from Lutheran to Baptist to Pentecostal.
Change, implemented poorly, lead to the Thirty Years' War, one of the longest and most destructive European conflicts in history. The major cause was an attempt to force the change from the top down and expect everyone to just fall in line. File this one away under how to get an entire nation to collapse into civil war or how to get the most talented people in one company into jobs at another.
Especially if you're in a company that traditionally doesn't share data, shifting that culture is hard and risky. All of us here have an understanding of technology. Conferences by their nature bring together like-minded people to share insular experiences and to have a period of deep-thought learning about what's to come next in a specific industry. The goal is to create a safe environment where honest sharing can occur and growth can begin. So how do we go about changing an organization that isn't already on the same page?
First, we have to agree to accept the following about Change:
- It's harder than we think
- It's going to take longer than we want
- We can't change anyone but ourselves
Then we can do the one thing we have available, start with a small step and start with ourselves.
If you're familiar with DevOps you may have seen the Three Ways of DevOps. If you aren't, open Amazon. Type in The Phoenix Project and just click "order" right now. (You're welcome Gene Kim) But seriously, this book is about driving change in your organization. Even in this core DevOps book, it doesn't talk about what tools you use, it talks about how you manage your work and your teams.
I'm skipping the basics and diving into tight feedback loops. Any time you implement change or recover from disaster, you need to review it and get feedback as soon as possible. This is mandatory to measuring the success of your change and making sure it has business value.
The most obvious form of feedback is the postmortem or retrospective after an incident or change.
Hopefully you have conducted a postmortem in the past for at least this reason.
If you're in IT Operations, though, have you invited your programming team into your postmortems?
If you're in Programming, have you invited your operations team into your postmortems?
Have you invited anyone from your executive team into your postmortems? Customer Support? Your marketing team? Physical Operations? Sales? Finance?
What about your customers?
And have you conducted postmortems after implementing an expected positive change, not just after a failure or an incident?
So if you're supporting your business, how can you expect to get positive business change if you don't invite your business to give feedback when you make expectedly positive changes? Or put another way, how do you know your changes are good?
In case you missed it: You need to at least invite your entire business to your learning reviews and retrospectives. That doesn't necessarily mean they'll show up, but opening your retrospective process allows for greater learning and understanding across the organization.
The company Chef famously posts all of their incident postmortems publicly. On YouTube. When they make mistakes, they review them on Google Hangouts and post them live to the internet.
Does this sound frightening? That's probably because when you think about a postmortem you think about trying to find a cause of death, trying to find out who or what to blame for the failure and blame carries a weight of correction and a fear of punishment. You must remove blame from the postmortem in order to make it effective.
Reframe the goal: We're not trying to find out what or who broke, we want to find out how to improve our systems and practices.
The US Forestry service asserts, if people believe they are going to be punished for being honest about what has happened, they quickly become subject matter experts in information suppression. Improving systems, requires honesty and honesty requires safety. We have to assume that all people generally want to do good and have made the best decisions they could at the time given the information and tools that they had.
Before any postmortem invoke safety in the space. What I mean is, you need to declare the space safe and blameless and frame the conversation before it begins. You call forth safety into the room and create that break from day-to-day operations. It's as simple as just making that declarative statement before it begins and having a moderator who ensures that everyone follows that:
"Before we begin, I just want to point out that Postmortems are Blameless. We don't focus on past events as they pertain to "could have" or "should have" because we're here to learn and gain new information going forward." (Adapted from Chef's Postmortem Template)
When someone asserts or accepts blame or begins to discuss what should have happened, it's important to call that out and realign the conversation. It's important to understand that we want short-term actionable takeaways that equate to more than "do better". "Do better" isn't concrete and is hard to measure so how do I know when I have Done Better? It feels overplayed and like corporate jargon, but SMART goals really are a good framework for building measurable and actionable outcomes from a postmortem or retrospective.
Blameless Retrospectives provide a great deal more feedback than technical as well. Our company recently had a failure that we wanted to review and for this incident the Timeline Retrospective seemed the most valuable.
All said and done, there was very little technically to be improved, honestly, most of that went right on the recovery end and what went wrong was something out of our control and rare enough there isn't too much of a change to be made. What we really gained from it was after our team created the "emotional scatter plot" under the timeline.
In that bottom row there was a point where our emotions diverged where some people were feeling overall positive and others overall negative. We had a communications breakdown and it caused a serious issue in our team even though our recovery efforts didn't clear it up.
I was able to say to my team, "I felt scared for our customer because I didn't know what was happening and I know how important X technology stack is to them," but more importantly, because we had made sure the space was safe and because we were insistent on that safety, my team was able to say, "I was angry when you called in. I felt like you didn't trust me and were trying to micromanage the incident". We gained an immense amount of insight into what we all individually felt during the outage and how our non-technical actions were having an affect on our environment.
From this we ended up with 3 action items to be completed within the next week. One was a technical write up (some knowledge transfer needed to take place). And two were process improvements focussed around communication and keeping everyone in the know about what was happening in major and potentially major issues to alleviate fears and builds trust. Processes we just didn't think to put in place because we hadn't had any problems until then. All of this was only possible because we were able to be honest and vulnerable in front of our team without fear of getting that turned against us.
I hope your still with me here because I really think that improving our monitoring techniques really is a perception and process issue more than it is a "am I using the right technology" issue. Whether you're using a product like Zenoss, Datadog, Nagios or New Relic to monitor everything or maybe you've rolled your own monitoring solution, it's not really incredibly important compared to how you're integrating that with your company.
We need to first question what we think is important. Stop thinking about our Systems' health and begin thinking in terms of Business health to make sure we're capturing the right metrics.
Once we reframe our thought process we need to partner with other data-driven departments in our organizations to understand what business metrics are important to them. We need to get more information so we can start to see the elephant.
And we need to solicit feedback to make sure the data and context we can provide to our business is useful. We need to lead by example by building safe environments where everyone feels OK to be vulnerable and get to the root of real learning.
Now you've got it, go out and change the world. Or maybe just start with your monitoring.