Chaos Engineering As A Service with AWS Fault Injection Simulator

In 2011, Netflix initiated the migration of their infrastructures from a private cloud to the AWS cloud. During this transition, Netflix reevaluated their infrastructure design approach, aiming to shift from a development model that assumed no outages to one that prioritized outage-proof systems.

This shift in mindset necessitated rigorous infrastructure testing to ensure the resilience of their designs. To achieve this, Netflix developed tools such as "Chaos Monkey" and popularized the concept of "chaos engineering."

Chaos engineering involves deliberately introducing failures and disruptions into a system to identify vulnerabilities and enhance its resilience. By subjecting their infrastructure to controlled chaos, Netflix aimed to proactively identify weaknesses and address them before they could lead to significant outages or service disruptions.

Through chaos engineering, Netflix sought to create a culture of resilience where system failures are anticipated and mitigated, ultimately improving the overall reliability and availability of their services.

By embracing this innovative approach, Netflix demonstrated their commitment to building robust, fault-tolerant systems that could withstand unexpected challenges in a dynamic cloud environment.

Chaos engineering concepts

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - Principles of chaos engineering

Multiples applications layers can be tested with chaos engineering tools :

  • Infrastructure (hypervisor)
  • Network (latency, outage)
  • Application it-self (missing modules, unsupported runtime)
  • Parameter (missing parameter, wrong format)

All of this are common concepts, that can be applied in the AWS world.

Chaos Engineering in AWS

In the previous section, we have seen why chaos engineering appears and what issues it answer. In this section, we will see how to apply these concepts and tools to an infrastructure hosted on AWS.

AWS Well Architected Framework

In 2018, AWS published their own framework to build performant, secure, resilient and cost effective infrastructures. This framework is based on 5 pillars:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

Reliability pillar is the one we will focus on a chaos engineering approach. Here’s the introduction of this pillar according to the official documentation:

The Reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.

To ensure compliance with industry frameworks and best practices, it is essential to thoroughly test our infrastructure throughout its entire lifecycle, from development to production. Considering that your production environment is hosted on AWS, it follows that your pre-production environment should also adhere to the same standards.

By conducting comprehensive testing at each stage of the infrastructure's lifecycle, we can verify its stability, security, and performance. This approach allows us to identify and address any potential issues or vulnerabilities before they impact the live production environment.

Testing the infrastructure in both pre-production and production environments provides valuable insights into its behavior and resilience under real-world conditions. It allows us to validate that the infrastructure performs as intended and remains compliant with the necessary standards and best practices.

Adhering to this approach ensures that your infrastructure is well-prepared to handle the demands of production and maintain a high level of reliability, security, and performance throughout its lifecycle.

In the next section, we will see how execute performance testing in the AWS world.

Why choose FIS


Jeremy

Read more