Incidents are a natural by-product of running a software product in production. While the team puts safeguards in place from day one and ensures their impact is limited, incidents are unavoidable.
However, each incident can be an opportunity for the team to not only prevent the issue from occurring again but also to even improve the product. The key is to document it with a high level of detail in a postmortem and ensure that the client stakeholders have been informed on time.
The postmortem is blame-free. This means it focuses on the learnings acquired during the incident and the actions needed to prevent the issue in the future. In other words, the team talks about what happened without pointing fingers at individuals.
Creating a Postmortem
Depending on the type of incidents, the Product Manager or the Engineering Lead can take the lead on writing the postmortem. The main difference lies in the technical depth required to document the incident.
Notion must be used as a documentation tool so that the postmortem can be stored along with other project documentation.
A postmortem must contain the following sections:
The incident is presented briefly, along with the resolution path taken. It acts as a one-paragraph summary of the incident.
The incident is presented in a high level of detail. The goal is to provide client stakeholders with a deep understanding of what the issue was.
The resolution process is presented in a high level of detail. The goal is to describe the team’s decision-making process to select a resolution path and eventually how the team made the fix.
Whether the incidence is specific to an area (e.g., authentication) or a list of users, causes a global outage, or has no impact (e.g., security breach patched before any harm was done), the side effects of an incident must be clearly identified and presented in detail.
This table-based section contains the list of events leading up to and during the incident, and it serves as a high-level view of the critical events.
Below is an example of what a timeline looks like:
Event name Date Channel The PM informs the squad to prepare for the 1.2.0 release Sep 10, 2021, 13:12 Slack The squad makes the release on Play Store Sep 10, 2021, 13:45 Play Store The EL notices the application was launched without permission Sep 10, 2021, 14:15 Play Store The EL un-publishes the application Sep 10, 2021, 14:16 Play Store
How Have We Done?
This section, comprising of two sub-sections, “What Went Well” and “What Did Not Go So Well”, similar to a retrospective, aims at providing an unbiased and comprehensive perspective of the incident’s significant learnings. To learn from incidents, the squad must understand what worked and what must be improved.
The final section of the postmortem presents the list of actions that must be taken following the incident to prevent its occurrence in the future. A Directly Responsible Individual (DRI) must be defined when different parties are involved for each action item.
This section can be omitted if there are no additional steps required.
As the main point of communication between the team and the client, the Product Manager is in charge of communicating about the postmortem with the client stakeholders.
The postmortem document must be shared in its entirety by email.
Depending on the severity of the incident, a meeting with the stakeholders can be a follow-up to review the postmortem.