Please enable JavaScript.
Coggle requires JavaScript to display documents.
Raft (algorithm (Safety (new leader is elected in case the current one is…
Raft
algorithm
Leader Election
-
-
steps
- servers start up as Follower state and remains as long as it receives RPCs from a leader or candidate
- Leader sends periodic heartbeats to all Followers to keep its leader state
- once a follower receives no RPC for a period of time (election timeout), it begins an election by
1) increment its current term
2) transits to Candidate state
3) votes itself and issues a RequestVote RPCs to all other servers in parallel
- stays in Candidate state until
a) it wins election (received votes from majority) and becomes Leader, then send heartbeat to all servers to notify and prevent new elections
b) received AppendEntries RPC with term >= current term of itself (another server has become Leader), then remember the new Leader and go back to Follower state
c) timed out with no winner (tied split vote), then the Candidate start a new election by incrementing its term and initiating another round of RequestVote RPCs
( :!: could repeat forever, use randomized election timeouts to avoid )
-
Log Replication
the leader
accepts log entries from clients,
replicates the log entries on other servers, and
tell servers when it is safe to apply log entries to their state machine
steps
- the Leader appends the command from client to its log as a new entry
- the Leader issues AppendEntries RPCs in parallel to all other servers to replicate the log entry
- when safely replicated, the Leader applies the log entry to its state machine and returns the result to the client
- if any Followers failed to receive, the Leader retry indefinitely until all Followers eventually store all log entries
-
-
Safety
-
when the Leader reckons the log entry is safe to apply to state machine (replicated to majority of all servers), the entry is called committed,
which is durable, and will be eventually applied to all available servers' state machines
Also commits all preceding entries in the Leader's log, including entries created by the previous Leaders
the Leader keeps track of highest index it knows to be committed, and includes that index in future AppendEntries RPCs (including heartbeats),
so that other servers will notice what log entries has been committed and apply them to its local state machine
if two entries in different logs have the same index and term, then they store the same command, and the logs are identical in all preceding entries
in AppendEntries RPCs, the Leader includes the index and term of the entry in its log that immediately precedes the new entries.
If a Follower does NOT find an entry in its log with the same index and term, it refuses the new entries
-
to handle inconsistency, the Leader force-override the Followers' logs to its own. the Leader find the last index that two logs match, and send the tailing logs after that point to the out-of-sync Followers (like detecting fork in two blockchains)
Leader maintains a nextIndex[] for each Follower, which is the index of next log entry the Leader will send to that Follower
Follower/Candidate failure, retry indefinitely with the same RPC, it's idempotent so no harm
-
safety timing requirement - broadcastTime << electionTimeout << MTBF(avg. time between failures for a single server)
0.5ms < broadcastTime < 20 ms
10ms < electionTime < 500ms
a few months < MTBF
Cluster Config Change
2-Phase approach
first switches to a transitional config, "joint consensus", which combines both old and new configs
once the joint consensus is committed, the cluster transitions to the new config
- Leader receive request to change config
- the Leader stores the config for joint consensus as a log entry and replicates that entry via AppendEntries RPCs
- Once a server receives and adds the new config to log, use that newest config for future decisions
- Leader create log entry for new config and replicate across cluster
-
joint consensus
-
-
Agreement for election and entry commitment requires separate majorities from both old and new configs
allows individual servers to transition between configurations at different times without compromising safety
-
-
Terminology
term
Raft divides time into terms (logical clock) of arbitrary length, numbered with consecutive integers
each term begins with an election, where candidates attempt to become leader, then serves as leader for the rest of the term
if election ends with split vote, end the term with no leader, and start a new term with new election
-
in some cases a server may not observe an election or even entire terms.
Each server stores current term number, which allows servers to detect stale leaders
current terms are exchanged whenever servers communicate. if one server has smaller current term comared to others', it updates to the larger value
if a candidate or leader find its term is out of date, itself immediately reverts to Follower state
-
-
Log Compaction
Snapshotting
-
each snapshot contains last included index, last included term and config the last included index has
-
-
-