Not My Circus
Posts
Chaos engineering is still considered to be esoteric, obscure or downright dangerous.

Chaos engineering is still considered to be esoteric, obscure or downright dangerous.

August 02, 2023

Chaos engineering is still considered to be esoteric, obscure or downright dangerous. And it’s not. And it’s in use by large corporates, startups, and unicorns(https://www.gremlin.com/customers/). It’s gone from being very specific about destroying EC2 instances — to see if that stops people watching Squid Game — to a fully-fledged suite of tools which can be used however, works for your team. Customers of Gremlin, one of the main chaos tools, use it for various reasons.

If you give your devs a blueprint for your systems it will be trivially easy to add compliance and chaos testing. Rahul Arya at JP Morgan Chase discusses “Embedding chaos into your pipeline”. JP Morgan Chase is a huge bank that looks after loads of assets — both of the bank themselves and their customers’ funds. They have $27 T assets under custody and deal with 465PB data

One of the main challenges they have had to deal with is manual compliance testing when provisioning services. Their end goal is to be able to deploy everything in 1 hour.

The Application developers themselves use Gremlin console directly and have access to the Gremlin API so they can check their own work to make sure there are no security or compliance issues. It’s easy for developers to use as everything is preinstalled into their architecture. Since the architectures for each application follow a template the chaos testing can also follow a set template. This simplifies it and allows teams to get used to integrating chaos integration into the rest of their workflow.

As well as testing their own code it’s also important to keep an eye on external feeds and ensure that they can deal with outages to them.

While working as an SRE I sometimes thought — If only things were so simple as to have an outage — half the time it is more of a brownout where the messages from an external feed start to come in slowly. This type of issue can also be simulated using Gremlin.

Chaos can also be used to test In region and out of region outages. Devs at JP Morgan Chase have a checklist that they can go through to ensure reliability is baked into their service. This sounds like a PITA but offloading these common issues and areas of concern to a checklist allows developers to focus on their main dev work while having the checklist in their back pocket so that they can once they’ve got a stable point in the dev cycle, think to themselves ‘have we thought about this? What about this?”

It seems that there is a need to go through a standardisation phase to ensure that you really understand the application — this shows one way to ease into chaos engineering as the building blocks of each app can become smaller and more detailed over time.

As any dev will tell you — a Proof of Concept quickly turns into production code which is in and sometimes you have to retrofit testing to ensure that the system you have ended up with fits your requirements for a production application in terms of availability and stability. It’s also possible that your services are so tightly coupled that the whole system is affected by an outage.

These things are natural in a fast-changing environment through this type of technical debt leads to fear of making changes to the system. It also means that there is often a wide impact when an outage and repeated outages lead to increased callouts, which can have a massive impact on the on-call team. Sometimes your team doesn’t have the time/budget to invest in system improvements or other technologies when you are busy dealing with issues.

Having introduced SRE and developed post mortem info, it is possible to identify areas where you can improve. This means that you can use data to decide what to focus on.

Many teams reinvent monoliths as microservices which is often helpful when done for the right reasons but does bring with it more reliance on effective inter-service comms. In order to deal with the additional reliance on this comms between services — as part of moving towards microservices — it is possible to add an additional layer of caching between the system itself and the integrations they depend upon. This additional layer caching can then be tested without affecting the internal system.

Performance testing and load testing go hand in hand with chaos engineering

Having established this way of working it is also possible to work with suppliers to ensure that they support chaos. Developers and other technical users can become confident running both the service and the tests in production.

Red / Blue practice means that it is possible to identify tests etc. which you wouldn’t think of yourself. e.g. working with platform or SRE teams and having them try to break your systems or some of their components

“Chaos engineering reduces the need for heroes”

It is worth investing in upskilling your team and building their skills. Then you should be able to get the people who wrote the original version to rewrite it and move to a microservices architecture (for example).

As Texas retailer HEB say — “don’t focus on the lifeboats”

Having brought in the chaos engineering capability then teams seem to want to spread it throughout the rest of the organisation

As anyone who has read Accelerate will know it is possible to have both an increase in uptime and an improvement in time to market. In fact, the two are very highly correlated. Not only does everyone want both, they both enable each other. Or potentially it is that they are both enabled by the same thing.

Having identified problem areas it is then possible to follow a test-first approach of looking at which other tests need to be developed. This is one of the benefits of putting a chaos engineering framework like Gremlin in place because you can flip it back around and once you have established a toolset you can look around for other things that can be tested. It’s not like you have a hammer and everything looks like a nail — you have a toolkit ad you look around for where else those tools can be applied.

This is another reason to get gamedays in place because once they are in place people understand the tool and they can then look for somewhere else to use it

There is a big push at the moment towards providing a self-service architecture and chaos engineering is no exception. It can often mean not only automation but also making those tools usable. Self-service has the added advantage that it means that people are not bottlenecking at the DevOps team. Even though more of the choices around when to run testing is given to the devs it is still important to ensure that devs cannot break all the things.

Reply

or to participate.