Please enable JavaScript.
Coggle requires JavaScript to display documents.
SRE (Cap. 21 - Handling Overload) - Coggle Diagram
SRE (Cap. 21 - Handling Overload)
Avoiding overload is a goal of load balancing policies.
But no matter how efficient your load balancing policy, eventually some part of your system will become overloaded.
Gracefully handling overload conditions is fundamental to running a reliable serving system.
The Pitfalls of "Queries per Second"(1)
Different queries can have vastly different resource requirements. A query's cost can vary based on arbitrary factors
such as the code in the client that issues them (for services that have many different clients) or even the time of the day (e.g., home users versus work users; or interactive end-user traffic versus batch traffic).
We learned this lesson the hard way:
modeling capacity as "queries per second" or using static features of the requests that are believed to be a proxy for the resources they consume (e.g., "how many keys are the requests reading") often makes for a poor metric.
Even if these metrics perform adequately at one point in time, the ratios can change.
A better solution is to measure capacity directly in available resources.
For example, you may have a total of 500 CPU cores and 1 TB of memory reserved for a given service in a given datacenter. Naturally, it works much better to use those numbers directly to model a datacenter's capacity.
In a majority of cases (although certainly not in all), we've found that simply using CPU consumption as the signal for provisioning works well, for the following reasons:
1) In platforms with garbage collection, memory pressure naturally translates into increased CPU consumption. 2) In other platforms, it's possible to provision the remaining resources in such a way that they're very unlikely to run out before CPU runs out.
Per-Customer Limits(2)
We aggregate global usage information in real time from all backend tasks, and use that data to push effective limits to individual backend tasks.
Criticality(3)
Criticality is another notion that we've found very useful in the context of global quotas and throttling.
A request made to a backend is associated with one of four possible criticality values:
CRITICAL_PLUS, CRITICAL, SHEDDABLE_PLUS, SHEDDABLE
We've made criticality a first-class notion of our RPC system and we've worked hard to integrate it into many of our control mechanisms so it can be taken into account when reacting to
overload situations.
When a customer runs out of global quota, a backend task will only reject requests of a given criticality if it's already rejecting all requests of all lower criticalities
.
When a task is itself overloaded, it will reject requests of lower criticalities sooner.
The adaptive throttling system also keeps separate stats for each criticality.
We've also significantly extended our RPC system to propagate criticality automatically.
If a backend receives request A and, as part of executing that request, issues outgoing request B and request C to other backends, request B and request C will use the same criticality as request A by default.
Handling Overload Errors(4)
A large subset of backend tasks in the datacenter are overloaded.
If the cross-datacenter load balancing system is working perfectly (i.e., it can propagate state and react instantaneously to shifts in traffic), this condition will not occur.
A small subset of backend tasks in the datacenter are overloaded.
This situation is typically caused by imperfections in the load balancing inside the datacenter. For example, a task may have very recently received a very expensive request. In this case, it is very likely that the datacenter has remaining capacity in other tasks to handle the request.
Deciding to Retry
BOOK!
It's a common mistake to assume that an overloaded backend should turn down and stop accepting all traffic. However, this assumption actually goes counter to the goal of robust load balancing. We actually want the backend to continue accepting as much traffic as possible, but to only accept that load as capacity frees up. A well-behaved backend, supported by robust load balancing policies, should accept only the requests that it can process and reject the rest gracefully.