Best Practices for Incident Response

When things go wrong, money and reputation are on the line. Minutes matter. Incident response is stressful and high-stakes. Nonetheless, there is a right and a wrong way to go about incident response. Some organizations and some individuals seem to excel at incident response, and others struggle.

In my experience, organizations that do incident response well have a combination of the right tools, the right culture and the right approach. Here are the best practices that allow organizations to respond quickly and efficiently when incidents happen.

Communicate

If there’s more than one person involved in the incident, communication is essential. When multiple people are working at the same time to fix the incident, everyone needs to know what the others are doing. Without communication, team members will likely duplicate efforts or, even worse, work at odds with each other.

Clearly define roles

Particularly in major incidents, having clearly defined roles helps teams collaborate better. Usually in this scenario there’s one person who takes the lead and assigns responsibilities. This ensures there’s no overlap as well as helping avoid missing steps (like communicating with stakeholders) because of misunderstandings about who is responsible. Having a command structure also helps facilitate communication among everyone working on the incident.

Be prepared

The success or failure of any particular incident response is determined at least as much by the actions the team takes before the incident as by the actions taken by the response team during the outage. Here’s how to be as prepared as possible for an incident:

  • Run Chaos Game Days. The best way to anticipate what might go wrong and practice how to fix it is to break things on purpose, in a controlled manner, during Chaos Game Days. This should be done both to prepare for possible ways the application could fail as well as to prepare for more routine events like launches or regular deploys, and how they could go wrong.
  • Set up visibility tools. During a truly novel incident, almost no one feels like there is enough visibility. But getting the right monitoring tools in place is something that has to be done beforehand and is often the difference between quick resolution and a drawn-out outage .
  • Write and update the run book. During an incident, an accurate, updated run book provides all the information a team needs to work methodically to diagnose and fix the incident rather than attempting to fix the problem through trial-and-error.

Follow the run book

Assuming you have an updated run book, follow it. Run books lessen the cognitive load during an incident and guide teams to address the most likely causes of the problem first. This ensures that the team will diagnose and fix the problem as quickly as possible.

Ask for help

It can be tempting for engineers to try too hard to fix the problem on their own, often delaying notifying others. Especially if it’s a large, customer-impacting incident, the best practice is to bring more hands on early — or perhaps immediately. This also helps engineers get out of the shell-shocked overwhelm that can take over when facing a particularly gnarly incident.  

Fix the cause, not the symptom

This last best practice is a balancing act. If it’s possible to find and remedy the symptom quickly but will take much longer to determine a potential cause, it makes sense to focus on getting the system back online and leaving the more complex fix for later. However, teams also shouldn’t always jump to fix the obvious symptom. Even during the incident, it’s always better to address the presumed offending cause — and doing so as much as possible will help build a more resilient application in the long run.  

What to avoid

No best practice post is complete without a couple traps organizations can easily fall into. When it comes to incident response, the most common mistakes I’ve seen are:

Failing to alert customers. Incident response is not just the process of fixing a technical bug, but rather an effort to remedy a poor user experience. One of the first steps, especially in a major bug or downtime, is to notify stakeholders about what’s going on. While this won’t make the technical problem go away any sooner, it will improve customers’ perception of the response.

Relying on too few people for incident response. Some organizations rely on a small cast of ‘heroes’ to fix all incidents. This works well in the short run, because these engineers get very good at understanding the system and making things work again. In the longer term, however, this makes it hard to scale and creates business risk because there’s too much reliance on a few people. It also leads to burnout, increasing the likelihood that the company will lose the very engineers it’s depending on to fix all incidents.

Incident response will never be relaxing, but it doesn’t have to be frantic, either. With the right processes, tools and preparation, teams can approach incidents with a clear strategy for finding and remedying the problem as quickly as possible. Not only will doing so improve engineers’ morale and customers’ experience, effective incident response that’s focused on methodically finding the offending cause of the problem will strengthen the application over time.

Incidents will never completely go away, but responding to them well can decrease their impact on customers and improve the company’s reputation. When incident response also focuses on fixing underlying causes, it will lead to increasingly fewer incidents as each outage improves the application’s resiliency.

With effx, the service-level visibility you need during an incident is available in one simple dashboard. Book a demo to see how it works.

William Li

William Li

just a regular guy. software engineer @ effx. previously @airbnb, @banjo.
SF Bay Area