How to Implement Chaos Engineering for Organizational Resilience

By Devoxx · 2024-02-24

Chaos engineering plays a crucial role in ensuring organizational resilience, especially in the face of demanding business landscapes. It involves simulating unexpected events and failures to test the system's response and improve readiness. In this blog, we will explore how chaos engineering can be implemented to enhance operational outcomes and reduce risks.

Chaos Engineering for People: A Story of Innovation

The speaker, Chris, shared a brief background, mentioning his work with AWS in Dublin, Ireland and XM as an online broker firm with a global client base of over 5 million.

He discussed the company's use of more than 60 applications in Kubernetes and their extensive use of AWS services such as Lambda, MSK, and Glue. Additionally, they utilize third-party tools like K-Pow, Axonius, and Fluorescent Management.

Chris highlighted the company's adherence to cloud standards and the establishment of a resilient system for services and applications, emphasizing the importance of supporting their people through innovative practices.

He revealed the process of 'making popcorn for chaos' and described the incorporation of six key steps including steady state hypothesis, actions, review, evidence, logging, and takeaway points to enhance their chaos methodology.

Chaos Engineering for People: A Story of Innovation

Incorporating Life Cycles into Major Processes for Improved Efficiency

Incorporating life cycles into major processes is essential to prevent mistakes from being repeated and to ensure nothing slips through the cracks, especially in the face of overwhelming workloads.

The implementation of life cycles extends to solution design, cost performance reviews, and even application reviews, demonstrating its broad impact across various domains.

The simulation of scenarios such as AC failure, port deletion, and request swarming has proven to be effective in testing the automation's response time and the system's ability to adapt to unforeseen events.

The service desk lab serves as a crucial component, providing First Responders with specialized training, onboarding processes, and simulations to prepare them for real incidents, ensuring preparedness and effective response.

Overall, integrating life cycles into major processes and conducting simulations contributes to organizational resilience, agility, and efficiency, ultimately leading to enhanced operational outcomes and reduced risks.

Incorporating Life Cycles into Major Processes for Improved Efficiency

Implementing Chaos Engineering Solution with AWS FIS

The team received a request from the service desk manager to simulate alerts without the knowledge of engineers, agents, or responders, leading to the initiation of a solution.

They utilized an API calling program such as Postman or red node to send an API call to an API Gateway, which is connected to a Lambda function triggering the AWS FIS (fault injection simulator) service designed for chaos engineering.

The FIS service can call a custom script through SSM, enabling actions such as restarting EC2, failover of databases, or modifying security groups directly via an API call, while being monitored by xavix, New Relic, or both for incident alerts.

Upon receiving the monitoring alert, the incident response team follows predefined runbooks to address the situation, with specific alerts from the service desk lab requiring confirmation from level two or the manager before taking action.

After completing the solution design lifecycle, the team deployed the solution and handed over the operation to the relevant team.

Implementing Chaos Engineering Solution with AWS FIS

Improving Cloud Operations with Onboarding and Security Measures

The team demonstrated how to use an API calling program and saved the configuration as a preset, allowing flexibility in usage.

Shifted security to the left, ensuring that API Gateway is controlled by resource policies for authentication and authorization.

The implementation included controlling Lambda function access, ensuring it only performs necessary actions and has restricted permissions.

Utilized SSM for creating scripts and Cloudwatch for logging metrics in the fault injection simulation service to enhance monitoring capabilities.

The onboarding process for engineers was faster and fostered a sense of confidence and familiarity with cloud operations.

Identified gaps in run books and configuration automation, leading to plans for improvements and enhancements.

Improving Cloud Operations with Onboarding and Security Measures

Improving Resiliency and Readiness in the Face of Demanding Business

The company emphasizes the importance of testing resiliency and readiness in the face of a demanding business landscape, particularly in the Forex industry.

They have implemented a chaos testing approach to simulate network failures and ensure that IT teams are prepared to handle unexpected challenges.

By creating multi-AZ and active-active databases, the company aims to understand how applications respond to network failures and automate chaos events to further enhance their preparedness.

Through regular drills and exercises, the company has observed significant improvements in response times, such as backup restoration times decreasing from 2 hours to just 12 minutes.

These efforts have proven beneficial not only in response to recent attacks but also in high-traffic situations, organizational turnover, and knowledge transfer among team members.

Additionally, the company plans to showcase their innovative solutions and job opportunities at an upcoming event, enticing potential candidates with exciting prizes like a PlayStation 5, Oculus VR, and a roborock automated vacuum.

Improving Resiliency and Readiness in the Face of Demanding Business

Conclusion:

Incorporating chaos engineering into your organizational processes can significantly improve resilience and preparedness. By simulating unexpected events and failures, you can enhance operational outcomes and reduce risks, ultimately leading to a more agile and efficient organization.