TNS
VOXPOP
What’s Slowing You Down?
What is your biggest inhibitor to shipping software faster?
Complicated codebase and technical debt.
0%
QA, writing tests, and debugging.
0%
Waiting for PR review or stakeholder approval.
0%
I'm always waiting due to long build times.
0%
Rework due to unclear or incomplete specifications.
0%
Inadequate tooling or infrastructure.
0%
Other.
0%
DevOps / Tech Culture

How PagerDuty Developed Its Postmortem Best Practices

As part of my training to become an Incident Commander, I studied PagerDuty’s incident response documentation. I was a former scrum master, and I was particularly interested in our postmortem process because it appeared to be the mechanism for continuous learning and iterative improvement in incident response.
Feb 11th, 2019 10:12am by
Featued image for: How PagerDuty Developed Its Postmortem Best Practices

PagerDuty sponsored this post, the second in a series about disseminating incident response knowledge.

Rachael Byrne
Rachael is an agilist that has helped cross-functional product development, mobile, QA, customer support and business operations teams iteratively deliver value and collaborate effectively. She is PagerDuty’s first non-technical Incident Commander.

As part of my training to become an Incident Commander, I studied PagerDuty’s incident response documentation. I was a former scrum master, and I was particularly interested in our postmortem process because it appeared to be the mechanism for continuous learning and iterative improvement in incident response.

But when I got to the postmortem process section of our incident response documentation, I was surprised by how light the instruction was.

Here’s a simplified version of the first few steps for the postmortem owner:

  1. Schedule the postmortem meeting;
  2. Create a timeline;
  3. Analyze the incident.

That’s where I got stuck. The documentation went on to say analysis involves capturing the impact and underlying cause, then moved on to Step Four: “Write the external message.” I was left wondering, “Wait, but how?” What activities and lines of thinking should be done to identify the underlying (or root) cause?

We thus decided to write a comprehensive guide on how to perform postmortems. No other resource (that we’ve found) covers the nuances of culture change, the details of how to perform an in-depth analysis and the unique skills required to facilitate a calm and engaging conversation about failure. Our goal with this new documentation is to go beyond just outlining the steps of the process and sharing a few tips and tricks. Instead, we explain why these concepts are important, describe the challenges associated with implementing them, and offer actionable instruction to conduct blameless postmortems.

The inherent complexity of software failures means identifying the underlying causes is easier said than done. In his paper, “How Complex Systems Fail,” Dr. Richard Cook said that because complex systems like software are heavily defended against failure, it is a unique combination of apparently innocuous failures that join to create a catastrophic failure.

Furthermore, because overt failure requires multiple faults, attributing a “root cause” is fundamentally wrong. There is no single root cause of major failure in complex systems; rather, it’s a combination of contributing factors that together lead to failure. There rarely will be a single, straightforward answer to what caused an incident — and if that’s the result of your incident analysis, you probably haven’t investigated enough.

When I began researching how to perform an in-depth incident analysis, I asked a few engineers about the steps they typically take. I learned they also found it difficult to put their finger on exactly what analysis involves. They just investigate. Through a series of interviews and some cross-disciplinary reading, I was able to identify concrete steps for a major incident postmortem analysis — check out the full postmortem documentation to read about them.

I’ve learned the key to performing a deep analysis is asking the right questions. You may be familiar with the “5 Whys” methodology, but I find that strategy to be arbitrarily limiting. I recommend asking a series of questions to explore multiple angles of what contributed to the incident.

Another way to start an analysis is by looking at your monitoring for the affected services. Search for irregularities like sudden spikes or flatlining from when the incident began and leading up to the incident. Include any commands or queries you use to look up data, graph images, or links from your monitoring tooling alongside this analysis so others can see how the data was gathered. If you don’t have monitoring for this service or behavior, make building monitoring an action item for this postmortem.

Another helpful strategy for targeting what caused an incident is reproducing it in a non-production environment. Experiment by modifying variables to isolate the phenomenon. If you modify or remove some input does the incident still occur?

Additionally, inspired by Gary Klein’s debriefing questions in Sidney Dekker’s “The Field Guide to Understanding Human Error,” we compiled a non-exhaustive list of questions to help encourage deep analysis.

Cues ●      What were you focusing on?

●      What was not noticed?

●      What differed from what was expected?

Previous knowledge/experience ●      Was this an anticipated class of problem or did it uncover a class of issue that was not architecturally anticipated?

●      What expectations did participants have about how things were going to develop?

●      Were there similar incidents in the past?

●      Is it an isolated incident or part of a trend?

Goals ●      What goals governed your actions at the time?

●      How did time pressure or other limitations influence choices?

●      Was there work the team chose not to do in the past that could have prevented or mitigated this incident?

Assessment ●      What mistakes (for example, in interpretation) were likely?

●      How did you view the health of the services involved prior to the incident?

●      Did this incident teach you something that should change views about this service’s health?

●      Will this class of issue get worse/more likely as you continue to grow and scale the use of the service?

●      What actions appeared to be available options?

Taking Action ●      How did you determine the best action to take at the time?

●      How did other influences (operational or organizational) help determine how you interpreted the situation and how you acted?

Help ●      Did you ask anyone for help?

●      What signal brought you to ask for support?

●      Were you able to contact the people you needed?

Process ●      Did the way that people collaborate, communicate, and/or review work contribute to the incident?

●      What worked well in your incident response process and what did not work well?

Collaboratively discussing the incident in the postmortem meeting leads to even deeper insight. For a successful postmortem meeting, it’s helpful to have a skilled facilitator who isn’t trying to contribute their own ideas to the discussion. They remain focused on creating an environment where all attendees feel comfortable speaking. By encouraging attendees to ask any and all questions, the facilitator helps the group get on the same page and consider new perspectives.

To learn more about all the steps involved in conducting effective postmortems and tips for facilitating the postmortem meeting, check out PagerDuty’s Postmortem Guide.

Curious to learn more? Check out the training for yourself.

Feature image via Pixabay.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.