Please enable JavaScript.
Coggle requires JavaScript to display documents.
Putnam CE2 failure testing, https://en.wikipedia.org/wiki/Chaos…
Putnam CE2
failure testing
technique
chaos engg
Netflix like chaos testing in prod
seems unlikely
May be we do it in dev? or dev Dr?
no need for ops team help
we are the only consumers
dedicated time window to run the test
tools
Litmus?
open source
apache based
seems active
ChaosMonkey?
may be expensive
(site says "contact us" :D)
completely manual?
create a playbook
expected results: may not be known
on the first iteration
"we will know it when we see it"
error should reflect failure
POC
Leverage the DR and network diagram
to determine system dependencies
create a playbook
for just one service
General service?
Learn from it for other services
Contains
what goes up/down when
what tests we run
how we run tests and capture data
what data to capture
what error is expected
people and roles
chaos coordinator
tester
data capturer
Test features
shutdowns
servers down
Temporarily blips
how to create these?
Longer duration
aws components down
db
audit log
main db
dynamo db
SQS? SNS?
api /ALB/DNS
pod shutdowns/ pod healthy but app down
Latency injection
how do we do this?
proxies? ssh tunnels?
Resource exhaustion
Prioritize what we want to test first
https://en.wikipedia.org/wiki/Chaos_engineering