What Is Chaos Engineering?

Tech is everywhere. Depending on how high stakes your industry is, failure of a tech product or system can fall anywhere between entirely negligible to the end of life as you know it.

Hospital mainframes? Kind of important. The resiliency of the Candy Crush app on your cell phone? Probably a bit lower on the overall list of priorities.

In a distributed system of networks, failure is inevitable. Preventing catastrophe begins with a solid, watertight security design. Beyond that, though, what else can be done?

What Is Netflix Chaos Engineering?

September 20th, 2015.

All quiet on the Western front, when, suddenly, several important corporate Amazon Web Services servers go down without a word.

Many huge companies were unable to provide for their customers for several hours. Netflix, however, was back on its feet in a matter of minutes. How? The internal company culture of Netflix had evolved to include many "failure-inducing" practices implemented in real-time to prepare both systems and engineers alike for when disaster strikes.

The company's leadership purposefully conducted simulated server outages in contained parts of the system to study and prepare for events such as these. This helped them identify holes in the system and build redundancies that allowed service to continue uninterrupted, even in the event of a major malfunction like the one mentioned previously.

These deliberate "chaos engineering" exercises gave their engineers enough of a competitive edge to see themselves through the fiasco, thanks in part to the preventative infrastructure that they'd built with this sort of doomsday event in mind.

Nobody else was ready when the big wave hit. The Netflix system was strong enough to fend for itself. Conclusion? These chaotic masterminds might be on to something here.

Intentionally Annihilating Those Who Love You

"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Principles of Chaos Manifesto

This is the heart of chaos engineering—in essence, a "fire drill" imposed on the system during working hours when there are eyes and hands available to address the challenge presented. A given system's ability to tolerate failure is put to the test as any vulnerabilities are exposed.

In its original 2011 context, chaos engineering concerned Netflix's IT department. Their leadership wanted to test how resilient the team's efforts were when one or more of their computers were intentionally disabled. These setbacks allowed the IT team to identify key weaknesses before they became system-wide issues and could be exploited from the outside.

Real failure? It can be costly as hell, and that goes beyond the monetary implications. Even periods of downtime, with no real lapse in security, will likely result in plenty of missed opportunities to earn revenue. Why wait for an emergency to blindside you?

The Monkeys Behind the Madness

Red teams in chaos engineering have internal black hats hackers that cause a ruckus for everybody else.

Some companies will adopt a "red team" model that pits teams of developers against their brethren across departmental lines. The classic example that Netflix instated, however, makes use of a "Simian Army". These bots do the dirty work for them fairly and totally at random.

Insane? To the layman, perhaps. In the words of "Chaos Monkeys" author Antonio Garcia Martinez:

"Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables and destroys devices. The challenge is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."

A colorful analogy. Not all of the Simians are cruel, however: Doctor Monkey monitors the performance of the system, for example. When Chaos Kong stops by for a visit, however, all bets are off; this character will take down an entire AWS availability zone.

Chaos Engineering and the Scientific Method

Chaos engineering helps your team fight the good fight.

Chaos engineering serves as a valuable source of systemic insight for those conducting the experiments. It is not only the developers who are being put to the test here; it is the system as it exists autonomously, as well.

Before dumping the barrel of monkeys out onto the table, chaos engineering requires a bit of groundwork to be laid.

You first need to identify what you consider to be a "steady", healthful, functional state for your system. This will be the “control” that you measure any tangible outcomes against.
Begin to think about how this state will be set off-balance by the intrusion of orchestrated failure. Plan your probing malware to only affect a contained, controllable area of your system.
Introduce the "intruder" and allow the system to respond.
Observe and interpret any differences between the system as it exists now and how it was behaving before, while in homeostasis. Increase your "blast radius" of impact until you either detect a vulnerability or reach full scale, whichever comes first.

The idea is that the more difficult it is to disrupt a functional system, the more confidence that you can have in its resiliency to change and bombardment. This approach shows how different aspects of the system will compensate for each others' failures in the event of an outage.

"Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system."

The Netflix Blog

Sometimes, toying with the system in this way doesn't even come close to impacting the customer experience. Other times, severe security flaws will be brought to light. Now, at Netflix especially, contingency meant to mask system failure at the user level is built into the foundation of the system.

Related: What Is a Zero-Day Exploit?

Is Chaos Engineering Worth It?

Critics will say that no back-end game is worth impacting a customer's experience, even if only briefly and by incident. Those in favor of chaos engineering, however, will rebut with the fact that these "planned outages" are meant to be much smaller than what AWS experienced in 2015. If a small, planned problem puts you in a position to prevent a much larger problem from ever presenting itself, planning the initial incident may be the best way to prepare. Fewer users will be affected in total. The math works out.

From the human end of the matter, the mentality is that, now, these engineers who have had a server crash in front of them and dealt with it competently will be both more alert in the future and also more intellectually equipped to handle whatever comes their way. The stronger system that results, in many cases, speaks for itself.

Silicon Valley: Where Dreams Go to Die

They say that if you want to make it big, you've got to be willing to kill your darlings, or, in this case, to be willing to let others kill them for you. When security is at the forefront from the very beginning of development, your team is much more likely to end up with something impenetrable and safe for customers to use freely.

Game-ifying the workplace experience makes the prospect of success in this domain exciting; when the end result is one of quality, everybody gets to level up. My Netflix runs just fine, and we have only the madmen behind the chaos to thank for it.

Now that you have a firm grasp on chaos engineering, why not expand your knowledge with another software development methodology? Agile is an excellent system you can incorporate to unify a workforce and produce clean, efficient code.