Get hands-on! Learn from incidents.
While game days are great for staying prepared for potential incidents, what about incidents that did occur. These are ideal opportunities for learning. Retrospectives that examine the root causes of an incident are key to fixing problems and processes and ensuring that they do not happen again. Blameless postmortems are a great tool for actively learning from incidents.
Remember how painful the last incident was? When everything is going well and running smoothly, it's easy to forget the pain and avoid digging into the root causes of failure. After all, the fast-moving cloud-native development environment is designed for speed of development and shipping of new features and functionality. It's easy to overlook the fact that a highly distributed system may in fact be more prone to failures than traditional software.
Using blameless postmortems is a way to avoid repeating the trauma and build resilience and efficiency into processes.
A postmortem is a discussion or analysis of an incident or event that occurs after an incident ends. It allows for a thorough understanding of an incident and should provide insight that can be applied to future incident management, answering what went wrong and why.
The team affected by the incident gets together and does a number of things:
With the increased speed and velocity of cloud-native development, incidents are a fact of life, and it's easy to point fingers when an incident occurs. The blameless postmortem approach prioritizes discovering and fixing root causes. The blameless aspect of the postmortem is key because, as often as technology businesses claim that failure represents an opportunity to learn and innovate, the propensity to blame and shame still pervades. Pointing the finger at any one employee or team isn’t productive to learning and does not encourage team members to come forward with issues or open communication more generally.
As for why a team, or a company more broadly, should do blameless postmortems? Aside from the fact that successful companies, such as Atlassian and Netflix, rely on them, they constitute an opportunity to:
Not every issue requires a postmortem. Postmortems make sense for larger and systemic issues, but not necessarily for ongoing minor issues or maintenance matters unless those kinds of issues end up leading to major incidents. Appropriate issues to address in blameless postmortem processes include:
In a blameless postmortem process, the answers focus on objective facts of what happened, and discovering the root cause of an issue, not opinionated views on where one team or another failed to do their job.
These questions remain the same whether or not the aim is blamelessness. It’s the answers that change. Determining how to avoid an undesirable outcome in the future relies on looking forward and identifying actionable items and owners for those actions.
While a good part of postmortems are technical in nature, that is, identifying what went wrong, another part of successful postmortems is cultural. Accepting the need to examine what went wrong is key to creating a more robust engineering culture. Aspects of a successful blameless postmortem include:
Move on to your next lesson.