What's Different about All-Remote Incident Response?

At first glance, incident response seems to be the one thing relatively unaffected by the transition to all-remote. Even in a small company, only around 20 to 30% of incidents are handled ‘in person,’ with a team of people all huddled around the same conference table. In my experience, at larger companies even those incidents that occur during work hours are managed ‘remotely.’ When an incident requires more than one person to be involved, those people often don’t work in the same building and won’t walk two blocks to another building just to work on the incident at the same table.

But effective incident response is more than a list of best practices to follow during the incident. Building a strong incident response process requires both preparation as well as post-mortems — both through a formal procedure and informal water-cooler post-mortems that often happen immediately after an incident. Working all-remote has implications for how teams share knowledge, prepare for on-call rotations, onboard new team members and discuss what went wrong following an incident. All of those things influence how successful individuals, teams and organizations are at responding to incidents.

The Unofficial Post-Mortem

If there’s a late night incident, it’s usually the first topic in the office in the morning — that’s just human nature. But discussing the incident in the morning is important for a couple reasons:

  • Everyone knows that the people involved directly in the incident response might be feeling grumpy and unproductive. No one takes it too personally when those people act accordingly;
  • Everyone realizes that the incident happened, even if it wasn’t an all-hands scenario. Even incidents that aren’t major site failures are important for everyone to be aware of but can slip through the cracks of formal reporting procedures;
  • It’s a learning experience for everyone. Talking about real incidents that happen in an informal context is a part of getting new people ready to join the on-call rotation and preparing the person who’s next in the on-call rotation. It also makes people realize the same incident could happen to them, encouraging them to think about how it could be avoided or what they would do if a similar incident happens during their on-call rotation.

The unofficial post-mortem doesn’t replace the thorough investigation of a true post-mortem, but it might be just as important to building a strong organizational culture and building an increasingly resilient application. And not all incidents warrant a formal post-mortem — those are usually reserved for more serious incidents because they take time.

Probably the biggest risk from losing the informal post-mortem is that minor incidents will slip under the radar, teams won’t be aware of how often they are occurring and will end up fixing the same problems repeatedly. Not only is this a waste of time — and most people would feel even grumpier about late-night incidents if it’s something that keeps happening — but failing to address mid-severity bugs is detrimental to the application’s long-term health.

Codifying the ‘unofficial’

One of the key challenges for all-remote teams is making informal processes of all kinds official. Finding formal ways to replace having coffee with the team or a lunchtime chat with your supervisor isn’t always easy, but doing so is what enables companies to work remotely without losing the team building and information sharing that happens in informal settings.

The best way to make up for the lost post-incident chat is to create an official process for all team members to learn about incidents that happen when they aren’t on call. A hand-off document should ideally be part of this official process, and should cover things like:

  • How many times were you alerted?
  • How many alerts required action?
  • What notable incidents happened?

The last thing a hand-off document should include is perhaps the most important: What lingering worries or thoughts does the person going off the rotation have? What comments or notes do you want to share? It’s important to communicate these intuitions about what might go wrong in the near future or what should be approached particularly carefully in the future. This is the most challenging thing to include in a hand-off document, because it can’t be automated and requires engineers to put into writing suspicions that they might not have hard evidence to support. Nonetheless, this is the type of information that people naturally share in person and is an important part of learning from and preparing for incidents.

Automated handoffs

As I mentioned above, the hand-offs that help teams replace informal discussion of incidents can’t be fully automated, because it requires sharing unfounded suspicions and, in some cases, emotions. An automated system won’t explicitly tell you that something should have been definitely fixed months ago, but that’s a sentiment you might hear from someone up all night handling an incident that has happened repeatedly beforehand.

But automating the part of the hand-off process that can be automated is a step towards formalizing informal incident discussions. With effx, there’s an automatic weekly report that provides information about how many alerts there are each week and how many of those alerts were actionable. This gives each engineer an idea of what the on-call rotation should be like and also makes it easier to spot trends or problem areas that should be investigated more thoroughly. By automating the parts of the on-call hand-off document that can realistically be automated, it also frees engineers to spend more time writing down the observations and recommendations that an automated system won’t capture.

Curious how effx can help you formalize your informal processes? Try it out now.

William Li

William Li

just a regular guy. software engineer @ effx. previously @airbnb, @banjo.
SF Bay Area