Please enable JavaScript.
Coggle requires JavaScript to display documents.
SRE (Cap. 14 - Managing Incidents) - Coggle Diagram
SRE (Cap. 14 - Managing Incidents)
Motivation (1)
"Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. If you haven’t gamed out your response to potential incidents in advance, principled incident management can go out the window in real-life situations."
The Anatomy of an Unmanaged Incident (2)
Sharp Focus on the Technical Problem
She wasn’t in a position to think about the bigger picture of how to mitigate the problem
because the technical task at hand was overwhelming.
Poor Communication
For the same reason,
Mary was far too busy to communicate clearly. Nobody knew what actions their coworkers were taking.
Business leaders were angry, customers were frustrated, and other engineers who could have lent a hand in debugging or fixing the issue weren’t used effectively.
Freelancing
Malcolm was making changes to the system with the best of intentions.
However, he didn’t coordinate with his coworkers—not even Mary, who was technically in charge of troubleshooting.
His changes made a bad situation far worse.
Elements of Incident Management Process (3)
Recursive Separation of Responsibilities
It’s important to make sure that everybody involved in the incident knows their role and doesn’t stray onto someone else’s turf.
Somewhat counterintuitively, a clear separation of responsibilities allows individuals more autonomy than they might otherwise have, since they need not second-guess their colleagues.
Incident Command
The incident commander holds the high-level state about the incident.
They structure the incident response task force, assigning responsibilities according to need and priority.
Operational Work
The Ops lead works with the incident commander to respond to the incident by applying operational tools to the task at hand.
The operations team should be the only group modifying the system during an incident.
Communication
This person is the public face of the incident response task force
. Their duties most definitely include issuing periodic updates to the incident response team and stakeholders (usually via email), and may extend to tasks such as keeping the incident document accurate and up to date.
Planning
The planning role supports Ops by dealing with longer-term issues, such as filing bugs, ordering dinner, arranging handoffs, and tracking how the system has diverged from the norm so it can be reverted once the incident is resolved.
When to Declare an Incident (4)
Do you need to involve a second team in fixing the problem?
Is the outage visible to customers?
Is the issue unsolved even after an hour’s concentrated analysis?
Conclusion (5)
"We’ve found that by formulating an incident management strategy in advance, structuring this plan to scale smoothly, and regularly putting the plan to use, we were able to reduce our mean time to recovery and provide staff a less stressful way to work on emergent problems."
Best Practices for Incident Management (6)
Prioritize
.Stop the bleeding, restore service, and preserve the evidence for root-causing.
Prepare
.Develop and document your incident management procedures in advance, in consultation with incident participants.
Trust
.Give full autonomy within the assigned role to all incident participants.
Introspect
.Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.
Consider alternatives
.Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.
Practice
.Use the process routinely so it becomes second nature.
Change it around
.Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.