As a relatively new practice, chaos engineering has plenty of myths surrounding it, from randomly shutting down production systems to requiring huge investments of time and money. There’s a lot of confusion over the purpose, the value and the practice of chaos engineering. This presents a problem for DevOps teams, especially since more than half of the teams surveyed in the 2019 Gartner DevOps Survey listed improving system reliability and release quality as one of their top five DevOps objectives.
In this article, we’ll clear up some of this confusion by presenting seven important truths to help you make an informed decision about chaos engineering and how it can help your engineering organization.
Truth 1: Chaos Engineering is not Chaotic
What comes to mind when you think of chaos engineering as a practice? If the answer is causing random production outages, you’re not alone. Tools such as Chaos Monkey popularized this idea, but for most teams practicing chaos engineering, it is a well-planned, controlled process that aims to mitigate chaos rather than cause it.
The goal of chaos engineering isn’t to add chaos, but to mitigate chaos.
It’s true that it involves creating potentially harmful conditions on otherwise healthy systems. For example, we might test our application’s ability to handle load by increasing CPU usage on our servers. With enterprise solutions we have full control over which systems are affected by this test (known as the blast radius), how much CPU we consume (known as the magnitude) and how long it runs for. We can also immediately stop the test and roll back its impact in case of unexpected consequences.
We also know exactly what conditions we’re introducing into our systems and can revert the changes at any time. Yes, we’re still causing harm, but we’re doing so in a way that helps us learn about our systems and reduces the risk of failures, both real-world and induced.
Truth 2: Developers Care About Reliability
Developers aren’t only interested in building new features. While development teams often prioritize feature development, this is likely due to business-driven initiatives resulting in rapid release schedules. Developers—especially those who have responded to incidents or worked through bug reports—understand the value of resilient software. They just don’t have time to build it.
The problem is that this creates a reliability gap in our applications. Rapidly building features without adequately testing their resilience creates failure modes, leading to problems that developers need to go back and fix at the expense of new projects. The longer we go before finding these failure modes, the more likely we are to experience unexpected behaviors and outages in our applications, and the more expensive a solution will be.