Incident response from monolith to microservices

Moving to a microservices architecture is a major change for any engineering organization. It impacts everything. From how teams collaborate and the tools they use, to the deployment of their system, and even how incidents are handled in production. In a monolith, expertise across the whole system is required to be able to operate it. As a result, incident response often falls on the technical lead and a few operations staff.

As you shift to microservices, the operational burden becomes too great for a handful of individuals to bear. Each new service you add is another component that can fail, and another item for your operations team to manage. This is not meant to imply that microservices are brittle. In fact, many adopt them for the promise of reliability. Like with any change, you must learn and adapt to the world around you to be successful.

Know the key differences

When compared to a monolithic system, there are some key differences to be aware of. First, the failure modes are very different. The journey to a microservices world is one of distributed computing. As you decouple functionality of a monolith, a given operation may require several calls to other services to complete. Each additional service call adds latency to the request and increases the risk of a call failing. Teams need to be aware of this when setting timeouts for calls and establishing service level objectives for their project. Configuring either of these incorrectly can have a ripple effect across your ecosystem.

Since the failure modes are different, risk profiles for applications change as well. A system that was once complicated to deploy can be broken down into several more manageable chunks. Each chunk requires a fraction of the expertise that the larger monolith once required. While you may still have systems like this, the majority of others can be handed off to other team members.

Establish developer accountability

Developers need to own their software end to end. The performance of their system is heavily influenced by the code they write and deploy. From local machines all the way through to production. When developers aren’t accountable for their systems, others pay the price. Maybe your operations team got paged and needed to roll back your change. Or, maybe you pushed an update that wound up costing the company a good amount of money. When developers are accountable for their systems, they’re more aware of the impact of their changes. And while mistakes do still happen, they tend to happen far less frequently and with less severity.

One way you can make your developers accountable for their systems is by establishing a daily first responder system. A daily first responder (DFR) is a designated member of the team who is responsible for handling all production alerts during office hours. This gives individual contributors hands on experience troubleshooting issues in production, with the support and guidance of others around them. It’s a great way for newcomers to learn about the system and how production is set up. DFRs are rotated on a weekly basis to prevent burnout and fatigue.

Another way you can make developers accountable is by having them on call outside of office hours. :audible_gasp: I know, but just hear me out. There’s an important feedback loop that a team develops when they’re the last line of defense at night. They learn about the gaps in their runbooks and changes the team needs to make to improve the reliability of their system.

Many developers seek roles that do not require them to be on call outside of business hours. Some have families they dedicate time to, others just want a nice clean break from the work day. While this may be true, I’ve found that many individuals volunteer to help with nightly on-call rotations. A big benefit to participating is that it helps you grow as an individual contributor. You learn about the types of failures that happen and how to design or adjust a system to handle it. Some lessons are much harder to learn in hindsight.

My suggestion? Make nightly on-call rotations voluntary. Allow those who are willing to participate, participate. If no one volunteers, then you’re back to your technical lead and operations team. Using a combination of DFR and voluntary on-call can drastically improve your incident response experience when moving to microservices.

Enable teams with visibility and access

As developers address more issues through DFR and on-call rotations, you’ll likely find yourself bottlenecked by your operations staff. Remember, misconfiguration is one of the most common factors in production events. Operations are often the gatekeepers to production configuration. Checking and verifying values requires elevated permissions. To address this bottleneck, some companies try to hire more members for their team. Others build out site reliability teams to bridge the gap between traditional ops and development. Both suffer from the same set of problems. The number of people actively debugging and troubleshooting issues is greater than the number of people who can resolve them.

Instead, I want to offer you a third option: enable and empower your existing team. That’s right. Find ways to give your developers access to production. Meet with your compliance and security teams early on to ensure their needs are met. Some may say “no developer should have access to production” is a requirement. Beware of solution statements like this. They attempt to avoid talking about the problem by pushing a foregone conclusion. When gathering requirements for a software system, your objective is to understand the problem domain, not the solution space. By discussing the problem domain, we’re able to understand what types of controls developers need to be effective while balancing the requirements of compliance and security.

This is often the hardest part of the process. It often requires letting go of ways your company has typically done things. It requires investing in your infrastructure earlier, rather than later. This can be stressful, but very rewarding. Remember, while you may be going through this at your company, there are many others out there that have already gone through it with theirs. Learn from their experience and lessons they chose to share. While many modern security tools have made it easier to grant access to production, there’s no substitute for a fresh perspective and hands on experience of others.

By choosing to empower your developers instead of the alternatives, you’ll notice a few things. For one, you’ve made a deeper commitment to compliance and security than you would have pursuing the traditional route. Had you chosen the former options, operations would only continue to be over-privileged, making it harder to lock down access later on. In addition, you’ll also find yourself with finer grained access to systems. Teams may only be able to see configuration keys that are relevant to projects they work on. Secret values may be shown if approved by a manager. Finally, with this great power comes great responsibility. Privileges can be granted to developers based on a combination of factors (such as their role, function, and level). This can help strengthen both your management and individual contributor career tracks.

Invest in your SRE team and practices

Like many roles in the tech industry, Site Reliability Engineer means something different depending on what company you talk to. I’ve seen some companies consciously develop their SRE team. I’ve also heard of others whose operations staff came into work the next day with a new title. As a result, how an SRE team is built matters.

An SRE team is typically composed of individuals from both operations and development backgrounds. It’s this hybridization that makes them effective. Each brings their unique experiences, with a shared focus on hardening production systems. Without one, the other is weakened. Keep the composition of your team in mind as well as their areas of expertise.

No one likes being interrupted. When developers have the access and visibility they need, they’re able to address and resolve issues without the aid of SRE. When developers are on call during the day, SRE is able to focus on hardening your production systems. When an SRE is interrupted to address developers requests because they lack access, both lose time and focus.

SREs are essential to reducing the burden of nightly on-call rotations. While there may be a moment here or there where the engineering team gets brought back into the mix, the majority of off-hours calls should be able to be handled by SRE. When SREs are unable to address the issue, the development team MUST get involved.

Parting thoughts

Even with all the best practices in the world, there isn’t a one-size fits all approach to this. Best practices help ensure: systems run smoothly; incidents are tracked, monitored, and responded to quickly; communication between stakeholders; the business continues to function. The companies who are successful often take a proactive approach, rather than a reactive one.

In practice, this means developing runbooks for applications and keeping them up to date as the system evolves. They develop paths for teams to share and discover knowledge about their systems beyond informal conversations and trial-and-error. Keep in mind, that each company is different and even within a company, teams and technologies can vary drastically. Each may have different requirements around their availability and performance of a system. If you're looking for something to help you adapt and remain proactive in your approach take effx for a spin, try it here for free.

Mya Pitzeruse

Mya Pitzeruse

software engineer @ effx 👩‍💻 previously @ Indeed.com