Please enable JavaScript.

Coggle requires JavaScript to display documents.

SRE (Cap. 15 - Postmortem Culture: Learning from Failure) - Coggle Diagram

- - - - User-visible downtime or degradation beyond a certain threshold
      - Data loss of any kind
      - On-call engineer intervention (release rollback, rerouting of traffic, etc.)
      - A resolution time above some threshold
      - A monitoring failure (which usually implies manual incident discovery)
- - - - Was key incident data collected for posterity?
      - Are the impact assessments complete?
      - Was the root cause sufficiently deep?
      - Is the action plan appropriate and are resulting bug fixes at appropriate priority?
      - Did we share the outcome with relevant stakeholders?