/shrug is not a strategy.

The Importance of Chaos Engineering in Production: Why You Should Test Your System's Limits

If you can’t see the need for Chaos Engineering, or you think it’s just a bit of fun and not to be taken seriously, or you think it’s one of those crazy ideas like Extreme Programming which you would never let loose on production - then allow me to put you right on a few things.

flock of birds flying under blue sky during daytime

There are a few facets that I think are important in terms of the benefits of Chaos Engineering:

- It allows you to break things

- It allows you to simulate breaking things in production

- It allows you to simulate scenarios which infrequently occur in production

- It allows you to define a hypothesis so that it is clear what is being tested and what you expect the results to be

- (Bonus) It forces the team to write tests for their infrastructure

- It also allows the team to quickly answer questions which begin with “have you thought about…?” in the affirmative.

The initial implementation of Chaos Engineering was done by Netflix and is called Chaos Monkey. It randomly switches off AWS instances so that engineers have to ensure that their systems and services can cope with this when it happens. Since then, it has evolved into an entire Simian Army at Netflix and into several sets of tools which allow greater flexibility in terms of the types of scenarios which can be tested and also in terms of putting a framework around running these tests and analysing the results. So the initial Chaos Monkey remit of ‘what happens if my EC2 instance dies?’ is the beginning rather than the end of the kinds of things which can be tested.

I can think of a few examples where I didn’t have Chaos Testing and wished I did:

Every. Single. Year. The clocks go back an hour. There are (in the UK) 2 am x 2. “Have you thought about Daylight Savings time?” Asks your friendly SRE as they will be the one sitting in a hotel room for over an hour as _something weird_ starts happening at 2 am and doesn’t finish until, erm, 2 am and what about the impact on transactions during those times?

Eyjafjallajokull - I didn’t fall on my keyboard to do some rudimentary random chaos testing - that’s the name of the volcano that left a massive trail of dust across most of Northern Europe in 2010.

brown and black mountain with white clouds

This then meant that no planes were taking off from the airport I was working with, which meant that no planes were on the runway, which meant that no flight plans or radar data were coming through. Which meant I had several days of being paged constantly. Chaos engineering would have allowed us both to test what happens when no data is coming through and also allowed us to identify in advance the difference between what happens when a feed goes down and what happens when a massive volcano erupts inconveniently.

Spent the last three months writing reams and reams of terraform? The thing about code is that it needs to be tested. The thing about infrastructure as code is that it needs to be tested. It turns out there’s no defined way, having set up an ALB on AWS, to ensure that your auto-scaling group scales when you hope it will. Chaos engineering can be used to simulate an increase in traffic to your website, for example, and ensure that it can cope and that it scales as expected.

These are just a few examples that hopefully show the kind of thing which chaos engineering can help with.

Chaos Engineering ensures that your system’s behaviour matches what you expect. A side benefit is that you will need to ensure that your system can be tested and that you know how it works. One final use of Chaos Engineering is documenting your infrastructure as code as you go, as with TDD and tests for other types of code. So if you find a little quirk somewhere in your system, then you have a definite source of truth which you can go back to in the shape of your Chaos Engineering tests.

Reply

or to participate.