SRE (Cap. 22 - Addressing Cascading Failures) - Coggle Diagram
SRE (Cap. 22 - Addressing Cascading Failures)
A cascading failure is a failure that grows over time as a result of positive feedback. It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail.
For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.
Causes of Cascading Failures and Designing to Avoid Them (1)
The most common cause of cascading failures is overload.
Most cascading failures described here are either directly due to server overload, or due to extensions or variations of this scenario.
Running out of a resource can result in higher latency, elevated error rates, or the substitution of lower-quality results.
These are in fact desired effects of running out of resources: something eventually needs to give as the load increases beyond what a server can handle.
Increased number of in-flight requests.
Because requests take longer to handle, more requests are handled concurrently (up to a possible maximum capacity at which queuing may occur). This affects almost all resources, including memory, number of active threads (in a thread-per-request server model), number of file descriptors, and backend resources
Excessively long queue lengths
. If there is insufficient capacity to handle all the requests at steady state, the server will saturate its queues. This means that latency increases (the requests are queued for longer amounts of time) and the queue uses more memory.
CPU or request starvation
Missed RPC deadlines
. As a server becomes overloaded, its responses to RPCs from its clients arrive later, which may exceed any deadlines those clients set. The work the server did to respond is then wasted, and clients may retry the RPCs, leading to even more overload.
Reduced CPU caching benefits.
As more CPU is used, the chance of spilling on to more cores increases, resulting in decreased usage of local caches and decreased CPU efficiency.
For example, a task might be evicted by the container manager (VM or otherwise) for exceeding available resource limits, or application-specific crashes may cause tasks to die.
Increased rate of garbage collection (GC) in Java, resulting in increased CPU usage.
A vicious cycle can occur in this scenario:less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU.This is known colloquially as the “GC death spiral.”
Reduction in cache hit rates.
Reduction in available RAM can reduce application-level cache hit rates, resulting in more RPCs to the backends, which can possibly cause the backends to become overloaded.
Thread starvation can directly cause errors or lead to health check failures.
If the server adds threads as needed, thread overhead can use too much RAM. In extreme cases, thread starvation can also cause you to run out of process IDs.
Running out of file descriptors can lead to the inability to initialize network connections, which in turn can cause health checks to fail.
Dependencies among resources
Note that many of these resource exhaustion scenarios feed from one another
—a service experiencing overload often has a host of secondary symptoms that can look like the root cause, making debugging difficult.
Resource exhaustion can lead to servers crashing; for example, servers might crash when too much RAM is allocated to a container.
Once a couple of servers crash on overload, the load on the remaining servers can increase, causing them to crash as well. The problem tends to snowball and soon all servers begin to crash-loop. It’s often difficult to escape this scenario because as soon as servers come back online they’re bombarded with an extremely high rate of requests and fail almost immediately.
Preventing Server Overload
Load test the server’s capacity limits, and test the failure mode for overload
This is the most important exercise you should conduct in order to prevent server overload.
Unless you test in a realistic environment, it’s very hard to predict exactly which resource will be exhausted and how that resource exhaustion will manifest.
Serve degraded results
Serve lower-quality, cheaper-to-compute results to the user.
Instrument the server to reject requests when overloaded
Servers should protect themselves from becoming overloaded and crashing.
When overloaded at either the frontend or backend layers, fail early and cheaply
Perform capacity planning
Good capacity planning can reduce the probability that a cascading failure will occur
. Capacity planning should be coupled with performance testing to determine the load at which the service will fail.
Instrument higher-level systems to reject requests, rather than overloading servers
Note that because rate limiting often doesn’t take overall service health into account, it may not be able to stop a failure that has already begun.Simple rate-limiting implementations are also likely to leave capacity unused.
Rate limiting can be implemented in a number of places: At the reverse proxies, At the load balancers and At individual tasks