CHAOS ENGINEERINGCompanies, People, Tools & Practices
Experiments In Production
End-User Companies
People
Tools
Training Experiments
Experiment Tools & Framework
Disaster Recovery Testing
Tools
Waterbear
“application resilience” as a service
Experiment
Latency Monkey
By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut - an infinite delay ; without having to ask the partner concerned to cut his service.
LinkedOut
Framework and tooling to test how user experience will degrade in different failure scenarios associated with downstream calls. It provides a seamless way to simulate failures across our application stack with minimal effort.
FireDrill
Provides an automated, systematic way to trigger/simulate infrastructure failure in production, with the goal of helping build applications that are resistant to these failures.
People
Jay Parikh
Vice president and head of engineering and infrastructure
People
Bhaskaran Devaraj
Senior Director, Site Reliability Engineering at LinkedIn
Hosting and Cloud Companies
Tools
Tools
Gremlin
Framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks.
Chaos Engineering: the history, principles, and practice
People
Tools
Pumba
Chaos testing and network emulation for Docker containers (and clusters)
People
Tools
Chaos Toolkit
Free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.
ChaosIQ
Platform for your teams to apply Chaos Engineering to their rapidly evolving, business critical Cloud Native microservices and platforms so they can build confidence that those systems won't fail your users.
Sylvain Hellegouarch
Engineering and Learning Chaos ; ChaosIQ founder & CTO
https://fr.slideshare.net/SylvainHellegouarch/mucon-2017-build-confidence-in-your-system-with-chaos-engineering
Russ Miles
Chaos Engineering Officer (CEO) of ChaosIQ.io https://fr.slideshare.net/russmiles/chaos-engineering-101-by
Gameday
Experiment
People
Jesse Robbins
Former Amazon « Master of disaster »
OrionLabs Founder and CEO
Creator of Gameday AWS
https://fr.slideshare.net/jesserobbins/ameday-creating-resiliency-through-destruction
Former fireman
Gameday AWS Interactive, six-part series to get hands-on cloud computing experience https://fr.slideshare.net/AmazonWebServices/game-days-crash-test-your-application-and-your-team
Days of Chaos
Inspired by AWS GameDays to test the resilience of its applications, teams volunteer applications in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.
Chaos Monkey
The first tool developed by Netflix, it allows random selection of instances in the production environment and deliberately put them out of service.
click to edit
Experiment
Yury Izrailevsky
VP, Cloud Computing and Platform Engineering
https://fr.slideshare.net/AmazonWebServices/ent101-embracing-the-cloud-final
Ariel Tseitlin
Investor, entrepreneur, and accomplished technology executive
Former Cloud Director at Netflix
https://fr.slideshare.net/atseitlin/aws-reinvent-2012-chaos-monkey-the-netflix-simian-army
Nora Jones
Senior Chaos Engineer at Netflix, formerly at Jet.
Co-author Chaos Engineering (O'Reilly 2017)
https://fr.slideshare.net/InfoQ/choose-your-own-adventure-chaos-engineering
Tools
Search Chaos Monkey
Search Chaos Monkey has been instrumental in providing a deterministic framework for finding exceptional failures and driving them to resolution as low-impact errors with planned, automated solutions.
Map based mainly on : https://github.com/dastergon/awesome-chaos-engineering
ChAP : Chaos Automation Platform
ChAP enables engineering teams to run Chaos Engineering experiments on live traffic in production in order to build confidence that their service will degrade gracefully when non-critical downstream services fail.
https://arxiv.org/pdf/1702.05849.pdf
Casey Rosenthal
Philosopher. Traffic and Chaos Engineering Manager
Hailstorm drives integration tests and simulates peak load during off-peak times
uDestroy intentionally breaks things so we can get better at handling unexpected failures
Experiment
Disaster Recovery Program (DiRT)
Google runs an annual, company-wide, multi-day Disaster Recovery Testing event—DiRT—the objective of which is to ensure that Google's services and internal business operations continue to run following a disaster.
People
Kripa Krishnan
Director, Cloud Ops & Site Reliability Engineering
Google's Queen of Chaos
Aaron P Blohowiak
Co-Author of O'Reilly's "Chaos Engineering". Work on distributed system reliability and design @ Netflix.
O'Reilly Velocity San Jose 2017: Precision Chaos
Lorin Hochstein
Putting the engineering in computer science
and the science in software eng. Academic refugee.
Chaos engineer, Netflix
People
Charles Torre
Chaos Engineering, Programming, Technical Leadership
https://msdevshow.com/2016/11/chaos-engineering-with-charles-torre/
James Hamilton
AWS VP, Ex-Microsoft Research
About testing in production, 2007
FIT: Failure Injection Testing
Platform that simplifies creation of failure within our ecosystem with a greater degree of precision for what we fail and who we will impact. FIT also allows us to propagate our failures across the entirety of Netflix in a consistent and controlled manner.
Heather Nakama
Software Engineer at Microsoft
https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/
Experiment
People
Tammy Butow
Site Reliability Engineering Manager
Now at Gremlin Inc.
Tools
People
Chaos Lemur
Cousin to Chaos Monkey, but built for Pivotal Cloud Foundry
Paul Harris
Staff Software Engineer
Tools
Chaos Gopher
Chaos testing/engineering in GO
People
Matthew Campbell
Ex-General Purpose GO Hacker at DigitalOcean
Cofounder at Loom Network https://www.slideshare.net/MatthewCampbell7/presentationchaosmonkey
Tools
Gremlin Fault Injection Tool
Simoorg
Open Source Failure Induction Framework
Kube-monkey
An implementation of Netflix's Chaos Monkey for Kubernetes clusters
Chaos Kong
King of Gorilla and drop a full Amazon Region
Matt Fornaciari
CTO - avid practitioner of #chaosengineering
Former at Salesforce and Amazon
Matt Jacobs
Engineering
Previously at Netflix https://fr.slideshare.net/MattJacobs11/using-hystrix-to-build-resilient-distributed-systems-58836753
Map created by :
with the help of Chaos Engineering Slack team and Chaos Community
Ali Basiri
Senior Software Engineer
Wreaking Havoc
People
Bruce M. Wong
Stitch Fix Eng - keeper of chaos, breaker of systems :: formerly practiced at Twilio, Netflix, Adobe
https://fr.slideshare.net/BruceWong3/the-journey-of-chaos-engineering-begins-with-a-single-step
Greg Orzell
Cloud Distributed Systems Architecture Consulting, at Crispy Mountain GmbH
Founded the Simian Army
James Burns
Software Architect at Stitch Fix
Former Tech Lead at Twilio
Sergiu Bodiu
Passionate IT craftsmanship #blitzscaling, avid student of life, autodidact, #cloudnative evangelist.
https://fr.slideshare.net/sbodiu/from-resilient-to-antifragile-chaos-engineering-primer-devseccon
Luke Koweski
Senior Software Engineer and a founding member of the Traffic & Chaos team at Netflix https://fr.slideshare.net/InfoQ/chaos-kong-endowing-netflix-with-antifragility
People
Alexei Ledenev
Chief Research Officer at Codefresh
https://fr.slideshare.net/alexLM/chaos-engineering-for-docker
Tools
People
Nemesis
Simulate error conditions using "disruptors"
Shay Holmes
Sr. Director, Engineering Services
Suresh Visvanathan
Nemesis Architect & Lead
Pavlos Ratis
Graduate Software Engineering MSc student at the University of Glasgow, Open Source Developer
Storm
To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.
Experiment
Too big to test: Breaking a production brokerage platform without causing financial devastation *https://cdn.oreillystatic.com/en/assets/1/event/124/Too%20big%20to%20test_%20Breaking%20a%20production%20brokerage%20platform%20without%20causing%20financial%20devastation%20Presentation%202.pdf
People
David Halsey
VP, Performance Engineering, Fidelity Investments
Kyle Parrish
Innovative, multi-dimensional leader focused on Technology Risk and Information Security in Financial Services
Something or someone missing ? Don't want to be on the map ? Please send me your feedbacks
The Ultimate Ressources to prepare your Gameday by DiUS
Thomissa Comellas
SRE causing chaos at Dropbox,
previously at StanfordEng, TeslaMotors.
click to edit
Tools
Chaos Monkey
Allows random selection of instances in the production environment and deliberately put them out of service.
Processkiller Monkey
Cousin of Chaos Monkey, e little more definitive...
Latency Monkey
By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut - an infinite delay ; without having to ask the partner concerned to cut his service.
Fulldisk Monkey
Allows to full a disk and test resilience of application, specillay logging
Properties Monkey
Allows to modify properties of an application and test resilience of application.
Monké Go
Not a monkey, but a automation platform to run monkeys during integration testing
People
Christophe Rochefolle
Experienced IT executive providing technology & organization to improve quality & agility of IT systems, Chaos Engineering fan
https://fr.slideshare.net/madrockriss/paris-chaos-engineering-meetup-1
Benjamin Gakic
SRE Architect
IT & #ChaosEngineering
https://fr.slideshare.net/madrockriss/paris-chaos-engineering-meetup-1
Experiment