Resilience can be defined as “the capacity to recover quickly from difficulties; toughness.” When I was starting my career in IT, racing to get features and automatons out the door, I was occasionally guilty of assuming the resilience of my code (and of the larger application/IT systems). To check whether you are guilty of the same, we first need to do some introspection.
As a business, we want to ensure that our systems are available and that we provide a great customer experience. To do this, we usually have various disciplines of testing, the most ubiquitous ones being Unit and Integration Testing. While Unit testing takes care of individual components’ quality, Integration testing addresses issues between components or even between different services.
Typically though, we don’t look for a service’s complete failure or for high latency in a service’s response. Couple this with the fact that almost all modern IT systems are very distributed in nature, and we have other issues like cascading failures that are very hard to foresee from a typical Sprint team’s perspective.
In this blog post, I am going to touch upon:
- the discipline of “Chaos Engineering”
- how to go about incorporating it into your company’s delivery workflows, and
- state of the art tools to consider.
Chaos Engineering is defined (from PrinciplesofChaos.org) as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” The goal is to expose weaknesses in your systems before they manifest themselves as some end-user service being down. By doing this on purpose, you and your systems become better at handling unforeseen failures.
You might be wondering “How does one go about this?”
A simple way to begin is by looking at recent Production issues and seeing if you could have caught any of those by experimenting earlier on. Usually many traditional shops have a Problem Management group that can spearhead this discussion, or you can check with your DevOps/Service team(s).
In any case, a good place to start is the Test environment. Introduce system degradation/graceful restarts using something like the Chaos Monkey.
Planning your approach to chaos experiments
The initial stages of your foray into introducing failures/chaos into your organizations play a vital role in ensuring success of your roll-out. The following pointers might come in handy as you begin your journey:
- know your application’s architecture and steady state metrics
- work with non-critical services that have a good steady state defined
- apply either an opt-out or opt-in (less aggressive) model for the delivery teams
- provide ways to evangelize your experiments with teams in QA environments
- have the necessary fallbacks in place (circuit breakers, for example) and verify if they triggered as expected
- after the experiment, ensure you are measuring and comparing against the known ‘steady state’ and becoming better (for example, aiming for a lower MTTR – Mean Time to Recover); run the tests again to measure.
Your goal is to slowly move towards Automated Chaos on the service in question.
From here, you can move into more specific experiments. For example, if you are doing failovers, create experiments where a specific business critical platform comes back up with a key piece missing. Consider a situation where a messaging/streaming platform fails over but with a topic missing, or with just half its intended capacity. Determine whether or not the system can handle this — or does it fail.
You can take this one step further by looking for any cascading impacts your failure might have. In the messaging example, maybe this fails your Loan application intake process, your payment processing or your checkout process. None of this can be clearly predicted until the experimentation phase. One key thing to remember is that in order to be successful in testing for cascading failures and addressing them in QA, you should evangelize and have the necessary service teams’ reps participating in these experiments.
Approaching chaos experiments from a Service owner’s perspective
For a service owner, there are some neat ways you can inject failure/latency into your own services and experiments. Using Fault Injection libraries like the Chaos Toolkit or the FIT platform (built within Netflix) can help here. The Chaos Toolkit provides ways to perform “probes” and “actions” on a service that help you to conduct experiments and perform rollbacks. For stateful applications running on Kubernetes containers, you can use a tool like Litmus.
Remember that our goal is not to cause problems, but to reveal them. We have to be careful in not overlooking the type and amount of traffic being impacted by this. Tools like the Chaos Automation Platform (ChAP) built within Netflix, are providing ways to route a percentage of the traffic to the experiment and thereby helping ‘to increase the safety, cadence, and breadth of experimentation.’
By keeping a close eye on the real, control and experiment traffic, we can better monitor business metrics and respond according to prior thresholds set in place by calling off the experiment before any impact to the business. Having the ability to short an experiment while having service owner/engineers debug on the side or find a solution are great ways to continue strengthening weaknesses, without fighting fires in production or impacting end customers.
You can take your resilience to the next level by building controlled outages into your DevOps process, i.e. testing, deployment or environmental disruptions. Introducing security experiments is another aspect of bringing weaknesses related to insufficient or overlooked security weaknesses to the forefront.
While Chaos experiments are very useful, one current limitation is the amount of upfront time involved in meeting and planning with different teams and finding good use cases and faults to inject into services. The industry and best practices are maturing as new algorithms are being tested to automate the identification of the right services to run experiments. . This can help reduce and eliminate the upfront meeting times and automate the finding of more critical flaws early on, before they surface as a production issue or customer complaint.
Werner Vogels, Amazon’s CTO is notorious for his quote “everything fails all the time”. This is even more true in the elastic cloud environment with applications architected on immutable infrastructure. So, the culture of asking “What happens if this fails? needs to shift to “What happens when this fails?”. At Sungard AS, we take pride in striving to provide state of the art resilience to our customers and look at these experiments as ‘continuous limited scope disaster recovery’. This helps us identify and address issues before they become news.
I hope the discipline, frameworks and maturity around Chaos Engineering that we touched upon come in handy as you embark on a journey towards true resiliency.