Brief Introduction To Resiliency Testing Using Chaos Engineering Concept:

Testing Reliability of the system is one of the major practice in today's devops world and Resiliency Testing is the great way to do it. In this blog,I will briefly discuss about Resiliency Testing Approach and how we can use Chaos Engineering concept for it.

Resiliency Testing is mainly for non-production environment. As part of this testing,we should run tests that injects failures regularly into pre-production environments.Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they don't match. In the cloud,you can test how your workloads fails and you can validate your recovery procedures. You can use automation to simulate different failures or to re-create scenarios that lead to failures before.This exposes failure pathways that you can test and fix before a real life scenario occurs, thus reducing risks.

Now coming back to Chaos Engineering concept,Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. In pre-production and testing environment,

chaos engineering should be done regularly, and be part of your CI/CD cycle.In production,teams must take care not to disrupt availability and should use Game Days as a way to control risk of chaos engineering in production.

Principles of Chaos Engineering:

  • Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.

  • We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent

  • An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. Please refer 'Principles of Chaos' for more details.

Failure Injection- Tools and Methods:

A variety of open-source tools exist to assist in the practice of Chaos Engineering in your organization.Foremost among this is Simian Army, which was developed by Netflix to test the reliability and security of AWS. Simian Army includes Chaos Monkey which can be used to find services in production and randomly disabled them as well as Chaos Gorilla, which disables entire availability zone.Finally, Chaos Kong disables an entire AWS region.

Other tools include Pumba,Blockade and Tugbot,three options for Chaos Testing in Docker,Chaos Dingo for Microsoft Azure, Monkey-Ops for Open shift, Chaos Lemur for BOSH managed environments, as well as Chaos HTTP Proxy for introducing failures into HTTP requests via a proxy server and Chaos Lambda, which randomly terminates auto scaling groups in AWS.

Netflix setup some great examples on how Resiliency Testing can be done using Chaos Engineering concept.Even through all of the Netflix services are hosted in AWS,the company realized that the large scale of their operations make failures unavoidable. To prepare for these failures, Netflix developed their own tool to create random disruptions to the system and tested it for resilience.The tool was designed to simulate ''unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables'' and was aptly called Chaos Monkey.By identifying weakness in their systems, Netflix can then build Automated Recovery Mechanisms to deal with them should they occur again in future.

The tool is run while Netflix continue to operate its services,although in a controlled environment and in ideal time frames. By only running Chaos Monkey during US business hours on weekdays, the company ensures that the engineers will have the maximum capacity for dealing with the disruptions and that server loads are minimal compares to peak customer usage times.

After early success,Netflix quickly developed additional tools to test other kinds of failures and conditions.Among these tools were Latency Monkeys,Conformity Monkeys,Doctor Monkey and others, collectively known as Netflix Simian Army.

Resiliency Testing with the Simian Army has since become a popular approach for many companies and in 2016, Netflix released Chaos Monkey 2.0 with improved UX and integration for Spinnaker.

Some of the common methods to inject failures in your AWS hosted applications:

  • Inject failures in Resource level by exhausting CPU,memory,I/O,disk space and hard drives.

  • Inject failures in Network level by introducing latency,loss,blocking APIs and messing with critical configuration files.

  • Inject failures in Application, process and service level by killing process,crashing database etc.

  • Inject failures in serverless world. For example,injecting failures in Lambda functions by introducing delays, execution via concurrency limits etc.

  • Inject failures in infrastructure level by randomly stopping EC2 instances, randomly making availability zone unavailable etc.


Hope,this blog can give you some idea about Chaos Engineering concept.It will help you greatly to uncover the unknowns in your system and fix them before they happen in production.It's all about doing some experiment in a controlled environment and apply those learning to improve your applications in a great way and build confidence.

Please do share your feedback by commenting in home page of this site or by dropping a feedback. Thanks for reading it :).

Recent Posts

See All

Steps to Setup kubernetes Cluster!!

In this blog, I will show you, how you can set up a Kubernetes Cluster using Kubeadm. For this hands on, I have used Ubuntu EC2s hosted in AWS environment.I have launched one master node and one work