Why Timelines Matter during an Incident

The first step during an incident is finding out what is broken — and how it broke. In most cases, there is a clear action that was taken that potentially caused the incident. Ideally, that action can be undone and the problem easily solved. Even when it’s not so simple to undo the change, possibly requiring a manual rollback or a second code change, knowing what the offending change was is the first step to resolving the incident.

The faster you can determine a potential cause of an incident, the faster you can fix it. Timelines are one of the most important tools in determining incidents’ cause, because understanding when changes happened in relation to when the error started is a basic troubleshooting tactic. Timelines are the best way to get that information.  

We generally talk about developing a timeline as part of the post-mortem process because without the right tools in place it’s a time-consuming process, but engineers need the information found in a timeline during the incident, too.

Looking for data

Most software incidents are the result of human actions — so at a minimum, knowing the actions leading up to an error report is critical to understanding what is going on now.  

Unfortunately, in most circumstances that information is not readily available in one place. In the absence of a tidy, correlated timeline of events for both an individual service and dependencies, engineers need to run through a list in their heads of which tool is most likely to have the relevant data. Here’s what the on-call engineer would likely do in the search for information that would be found in a timeline:

  • Check the deployment histories in Spinnaker, Argo CD or other continuous deployment tool.
  • Check feature flag histories in LaunchDarkly or similar feature flag management tools.
  • Check the histories of the Kubernetes cluster
  • Check for changes in the underlying infrastructure, in the databases or in the network.
  • Check any external providers to see if they had any changes.

At every step, engineers have to scan through the dashboard, then run through a mental checklist of what to do next. Even when one dashboard has potentially useful data, the engineer should be thinking about what other tools might have prevalent data as well. If none of the dashboards show anything out of the ordinary, the next step is to start the same process with all of the services’ dependencies. Even if you’re looking for information in the Spinnaker history, each service’s deploy history is going to be on a different page, and the engineer will have to toggle between them — and will also have to remember which dependencies to check.

When no one tool can be trusted to have all the relevant timeline data, it also increases the risk that an engineer will find incomplete information in one tool and end up following it down a dead-end. The need to toggle between dashboards and go through a mental checklist of all the possible tools to check also adds to the cognitive load during an incident, which can cause engineers to make bad decisions or even become overwhelmed and freeze up.

Keeping it in one place

An engineer could easily spend at least five minutes on each of the above steps, trying to find the right information, and in a worst-case scenario could have seven or eight dashboards to go through. That could mean thirty or forty minutes wasted before he or she finds the right information. That’s not taking into account that there might be relevant information from multiple dashboards. If the engineer doesn’t have a huge monitor or two monitors, that means toggling between the dashboards to see how the information matches up and whether or not it points to the incident cause.

With a timeline tool like effx, all of the information is in the same place, whether it’s about an individual service or that service plus all dependencies. With all the information about recent deploys, feature flag changes, events in the cluster and changes to the databases or other infrastructure is in the same place, it’s much easier to correlate events and to figure out which one led to the incident.

It helps companies get away from a ‘gut feel’ approach to which dashboard to check first. Especially in cases where initial intuitions are wrong, it dramatically reduces the amount of time it takes for engineers to understand what happened — meaning they can start fixing it sooner.

Would you like to see how different the incident response process is with a timeline tool like effx? Schedule a demo.

William Li

William Li

just a regular guy. software engineer @ effx. previously @airbnb, @banjo.
SF Bay Area