CHAOS ENGINEERINGCompanies, People, Tools & Practices

Experiments In Production

End-User Companies

uploaded image

People

Tools

Training Experiments

Experiment Tools & Framework

Disaster Recovery Testing

uploaded image

Tools

uploaded image
Waterbear
“application resilience” as a service

uploaded image

Experiment

uploaded image
Latency Monkey
By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut - an infinite delay ; without having to ask the partner concerned to cut his service.

LinkedOut
Framework and tooling to test how user experience will degrade in different failure scenarios associated with downstream calls. It provides a seamless way to simulate failures across our application stack with minimal effort.

FireDrill
Provides an automated, systematic way to trigger/simulate infrastructure failure in production, with the goal of helping build applications that are resistant to these failures.

People

Jay Parikh
Vice president and head of engineering and infrastructure uploaded image

People

Bhaskaran Devaraj
Senior Director, Site Reliability Engineering at LinkedIn uploaded image

Hosting and Cloud Companies

uploaded image

Tools

uploaded image

Tools

Gremlin
Framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks.
Chaos Engineering: the history, principles, and practice

People

uploaded image

Tools

Pumba
Chaos testing and network emulation for Docker containers (and clusters)

People

Tools

uploaded image Chaos Toolkit
Free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.

ChaosIQ
Platform for your teams to apply Chaos Engineering to their rapidly evolving, business critical Cloud Native microservices and platforms so they can build confidence that those systems won't fail your users.

uploaded image
Russ Miles
Chaos Engineering Officer (CEO) of ChaosIQ.io
uploaded image https://fr.slideshare.net/russmiles/chaos-engineering-101-by

uploaded image Gameday

Experiment

People

uploaded image
Jesse Robbins
Former Amazon « Master of disaster »
OrionLabs Founder and CEO
Creator of Gameday AWS
uploaded image
https://fr.slideshare.net/jesserobbins/ameday-creating-resiliency-through-destruction

Former fireman uploaded image

Gameday AWS Interactive, six-part series to get hands-on cloud computing experience uploaded image https://fr.slideshare.net/AmazonWebServices/game-days-crash-test-your-application-and-your-team

Days of Chaos
Inspired by AWS GameDays to test the resilience of its applications, teams volunteer applications in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.


uploaded image
Chaos Monkey
The first tool developed by Netflix, it allows random selection of instances in the production environment and deliberately put them out of service.

click to edit

Experiment

uploaded image
Ariel Tseitlin
Investor, entrepreneur, and accomplished technology executive
Former Cloud Director at Netflix
uploaded image
https://fr.slideshare.net/atseitlin/aws-reinvent-2012-chaos-monkey-the-netflix-simian-army

uploaded image
Nora Jones
Senior Chaos Engineer at Netflix, formerly at Jet.
Co-author Chaos Engineering (O'Reilly 2017)
uploaded image https://fr.slideshare.net/InfoQ/choose-your-own-adventure-chaos-engineering

uploaded image

Tools

uploaded image
Search Chaos Monkey
Search Chaos Monkey has been instrumental in providing a deterministic framework for finding exceptional failures and driving them to resolution as low-impact errors with planned, automated solutions.

ChAP : Chaos Automation Platform
ChAP enables engineering teams to run Chaos Engineering experiments on live traffic in production in order to build confidence that their service will degrade gracefully when non-critical downstream services fail.
https://arxiv.org/pdf/1702.05849.pdf
uploaded image

Casey Rosenthal
Philosopher. Traffic and Chaos Engineering Manager

uploaded image

Hailstorm drives integration tests and simulates peak load during off-peak times

uDestroy intentionally breaks things so we can get better at handling unexpected failures

uploaded image

Experiment

Disaster Recovery Program (DiRT)
Google runs an annual, company-wide, multi-day Disaster Recovery Testing event—DiRT—the objective of which is to ensure that Google's services and internal business operations continue to run following a disaster.

People

Kripa Krishnan
Director, Cloud Ops & Site Reliability Engineering
Google's Queen of Chaos
uploaded image

uploaded image
Aaron P Blohowiak
Co-Author of O'Reilly's "Chaos Engineering". Work on distributed system reliability and design @ Netflix.
O'Reilly Velocity San Jose 2017: Precision Chaos

Lorin Hochstein
Putting the engineering in computer science
and the science in software eng. Academic refugee.
Chaos engineer, Netflix

People

uploaded image
James Hamilton
AWS VP, Ex-Microsoft Research
About testing in production, 2007

FIT: Failure Injection Testing
Platform that simplifies creation of failure within our ecosystem with a greater degree of precision for what we fail and who we will impact. FIT also allows us to propagate our failures across the entirety of Netflix in a consistent and controlled manner.

uploaded image

Experiment

People

uploaded image
Tammy Butow
Site Reliability Engineering Manager
Now at Gremlin Inc.

uploaded image

Tools

People

uploaded image
Chaos Lemur
Cousin to Chaos Monkey, but built for Pivotal Cloud Foundry

uploaded image
Paul Harris
Staff Software Engineer

uploaded image

Tools

uploaded image
Chaos Gopher
Chaos testing/engineering in GO

People

uploaded image
Matthew Campbell
Ex-General Purpose GO Hacker at DigitalOcean
Cofounder at Loom Network
uploaded image https://www.slideshare.net/MatthewCampbell7/presentationchaosmonkey

Tools

Gremlin Fault Injection Tool

Simoorg
Open Source Failure Induction Framework

Kube-monkey
An implementation of Netflix's Chaos Monkey for Kubernetes clusters

uploaded image
Chaos Kong
King of Gorilla and drop a full Amazon Region

uploaded image
Matt Fornaciari
CTO - avid practitioner of #chaosengineering
Former at Salesforce and Amazon

Map created by :

with the help of Chaos Engineering Slack team and Chaos Community

uploaded image
Ali Basiri
Senior Software Engineer
Wreaking Havoc

uploaded image

People

uploaded image
Bruce M. Wong
Stitch Fix Eng - keeper of chaos, breaker of systems :: formerly practiced at Twilio, Netflix, Adobe
uploaded image
https://fr.slideshare.net/BruceWong3/the-journey-of-chaos-engineering-begins-with-a-single-step

uploaded image
Greg Orzell
Cloud Distributed Systems Architecture Consulting, at Crispy Mountain GmbH
Founded the Simian Army

uploaded image
James Burns
Software Architect at Stitch Fix
Former Tech Lead at Twilio

uploaded image
Sergiu Bodiu
Passionate IT craftsmanship #blitzscaling, avid student of life, autodidact, #cloudnative evangelist.
uploaded image
https://fr.slideshare.net/sbodiu/from-resilient-to-antifragile-chaos-engineering-primer-devseccon

Luke Koweski
Senior Software Engineer and a founding member of the Traffic & Chaos team at Netflix
uploaded image https://fr.slideshare.net/InfoQ/chaos-kong-endowing-netflix-with-antifragility

People

uploaded image

Tools

People

uploaded image
Nemesis
Simulate error conditions using "disruptors"

uploaded image
Shay Holmes
Sr. Director, Engineering Services

uploaded image
Suresh Visvanathan
Nemesis Architect & Lead

uploaded image
Pavlos Ratis
Graduate Software Engineering MSc student at the University of Glasgow, Open Source Developer

Storm
To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.

uploaded image

Experiment

People

uploaded image
David Halsey
VP, Performance Engineering, Fidelity Investments

uploaded image
Kyle Parrish
Innovative, multi-dimensional leader focused on Technology Risk and Information Security in Financial Services

Something or someone missing ? Don't want to be on the map ? Please send me your feedbacks

uploaded image
The Ultimate Ressources to prepare your Gameday by DiUS

uploaded image
Thomissa Comellas
SRE causing chaos at Dropbox,
previously at StanfordEng, TeslaMotors.

click to edit

uploaded image

Tools

uploaded image
Chaos Monkey
Allows random selection of instances in the production environment and deliberately put them out of service.

uploaded image
Processkiller Monkey
Cousin of Chaos Monkey, e little more definitive...

uploaded image
Latency Monkey
By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut - an infinite delay ; without having to ask the partner concerned to cut his service.

uploaded image
Fulldisk Monkey
Allows to full a disk and test resilience of application, specillay logging

uploaded image
Properties Monkey
Allows to modify properties of an application and test resilience of application.

uploaded image
Monké Go
Not a monkey, but a automation platform to run monkeys during integration testing

People

uploaded image
Christophe Rochefolle
Experienced IT executive providing technology & organization to improve quality & agility of IT systems, Chaos Engineering fan
uploaded image
https://fr.slideshare.net/madrockriss/paris-chaos-engineering-meetup-1

image

Experiment