Consensus Algorithm - Raft
All contents are modified and cite from the paper “In Search of an Understandable Consensus Algorithm (Extended Version)” by Diego Ongaro and John Ousterhout at Stanford University. It could be checked on the Raft consensus algorithm website. For a Chinese translate version, check 译 - In Search of an Understandable Consensus Algorithm (Extended Version) | 咕咕 (bugwz.com).
This only talk parts of the paper, for full version check paper directly!
Intro
Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but with a more understandable structure than Paxos.
Raft is similar in many ways to existing consensus algorithms, but it has several novel features:
- Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. Ex. It simplifies the management of the replicated log by set log entries only flow from the leader to other servers.
- Leader election: Raft uses randomized timers to elect leaders. This adds some mechanism based on the heartbeats(Implemented on PA1) from usual consensus algorithm, which makes conflicts resolve simply and rapidly.
- Membership changes: Raft uses a new joint consensus approach for the mechanism of changing the set of server in the cluster. The majorities of two different configurations overlap during transitions which allows the cluster to continue operating normally during configuration changes.
Replicated state machines
Replicated state machines are used to solve a variety of fault tolerance problems in distributed system since it is a collection of servers compute identical copies of the same state that used to replace the servers who is down which make system could continue operating.
How does it be replicated?
The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.
As shown in figure above, replicated state machines are typically implemented using a replicated log:
- Each server stores a log containing a series of commands, which its state machine executes in order.
- Each log contains the same commands in the same order, so each state machine processes the same sequence of commands.
- Since the state machines are deterministic, each computes the sam estate and the same sequence of outputs.
Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log, then communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.
Properties
Consensus algorithms for practical systems typically have the following properties:
- They ensure safety under all non-Byzantine conditions (never returning an incorrect result), including network delays, partitions, and packet loss, duplication, and reordering.
- They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients.
- Thus, a typical cluster of 5 servers can tolerate the failure of any 2 servers.
- Servers may later recover from state on stable storage and rejoin the cluster while they are assumed to fail by stopping.
- They do not depend on timing to ensure the consistency of the logs.
- Faulty clocks and extreme message delays can, at worst, cause availability problems.
- In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls. (A minority of slow servers need not impact overall system performance.)
What’s wrong with Paxos?
Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.
However, Paxos has two significant drawbacks:
Paxos’ understanding difficulty
Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. It is divided into two stages that do not have simple intuitive explanations and cannot be understood independently, which makes it difficult to develop intuitions about why the single decree protocol works.
Poor for building practical systems
Paxos does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos since Lamport’s description only sketched possible approaches to mult-Paxos, nut many detail are missing.
Lamport’s description are mostly about single-decree Paxos, which also caused Paxos architecture became a poor one for building practical systems by the single-decree decomposition.
- Paxos choosing a collection of log entries independently and then melding them into a sequential log which just adds complexity. (It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order)
- Paxos uses a symmetric p2p (peer-to-peer) approach at its core (Although it suggests a weak form of leadership as a performance optimization). It would only makes sense in a simplified world where only one decision will be made. (It is simpler and faster to first select a leader, then have the leader coordinate the decisions)*.
As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, but end with a significantly different architecture due to the difficulties in implementing Paxos. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value.
There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system… the final system will be based on an unproven protocol. –Chubby inmplmenters
A typical comment about the Paxos’ lack of practical systems implementation.
Raft Designing for understandability
There are several goals in designing Raft:
- It must provide a complete and practical foundation for system building.
- It must be safe under all conditions and available under typical operating conditions.
- It must be efficient for common operations.
- It must be possible for a large audience to understand the algorithm comfortably. (It should gives audience intuitions about algorithm, which make the extensions inevitable in real-world implementations)
It used two techniques that makes analysis generally applicable.
Problem decomposition
It divided problems into separate peces that could be solved, explained, and understood relatively independently.
Ex. Raft separated leader election, log replication, safety, and membership changes.
Simplify the state space
It simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible.
In Raft, logs are not allowed to have holes and it limits the ways in which logs can become inconsistent with each other. Although in most cases our goal is to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. (In particular, randomized approaches introduce nondeterminism since “It doesn’t matter which one to choice.”)
Ex. Raft use randomization to simplify the leader election algorithm.
The Raft consensus algorithm
Raft is an algorithm for managing a replicated log of the form described in Section Replicated State Machine.
Summary of the Raft
A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction) are shown below:
The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. (Section numbers such as §5.2 is meanless for this notes) A formal specification describes the algorithm more precisely.
Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines which simplifies the management of the replicated log. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.
Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems:
- Leader election: a new leader must be chosen when an existing leader fails.
- Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.
- Safety: the key safety propoerty for Raft is the State Machine Safety Property in Figure below. if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.
Raft guarantees that each of these properties shown above is true at all times. (Section numbers such as §5.2 is meanless for this notes)
Raft basics
A Raft cluster contains several servers. At any given time each server is one of three states: leader
, follower
, or candidate
. In normal operation there is exactly one leader
and all of the other servers are followers
.
Leader
The leader
handles all client requests (if a client contacts a follower
, the follower
redirects it to the leader
.)
Follower
Followers
are passive, they issue no requests on their own but simply respond to requests from leader
and candidates
.
Candidate
Candidate
is used to elect a new leader as described above.
Raft Server States
The figure above shows the states and their transitions. Followers
only respond to requests from other servers. If a follower
receives no communication, it becomes a candidate
and initiates an election. A candidate
that receives votes from a majority of the full cluster becomes the new leader
. Leaders
typically operate until they fail.
Raft Terms
Raft divides time into terms of arbitray length, as shown below:
Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader. In some situations an election will result in a split vote, in this case the term will end with no leader, and a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Raft communication
Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. (The third RPC has formed later)
RequestVote RPCs
: They are initiated bycandidates
during elections.AppendEntries RPCs
: They are initiated byleaders
to replicate log entries and provide a form of heartbeat.InstallSnapshot RPCs
: Used for transferring snapshots between servers.
Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.
Leader election
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers
. A server remains in follower
state as long as it receives valid RPCs from a leader
or candidate
. Leaders send periodic heartbeats (AppendEntries RPCs
that carry no log entries) to all followers
in order to maintain their authority. If a follower
receives no communication over a period of time called the election timeout, then it assumes there is no viable leader
and begins an election to choose a new leader
.
To begin an election, a follower
increments its current term (Ex. Term 3 ➡️ Term 4) and transitions to candidate state. It then votes for itself and issues RequestVote RPCs
in parallel to each of the other servers in the cluster. A candidate
continues in this state until meets one of three conditions :
- It wins the election
- Another server wins election (establishs itself as
leader
) - A period of time goes by with no winner
Each sever vote one candidate
in a given term with fcfs basis. A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.
While waiting for votes, a candidate
may receive an AppendEntries RPC
from another server claiming to be leader and will have two possible situations:
- Accept: If
leader
‘s term (included in its RPC) is at least as large as thecandidate
‘s current term, then thecandidate
recognizes theleader
as legitimate and returns tofollower
state. - Decline: If the term in the RPC is smaller than the
candidate
’s current term, then thecandidate
rejects the RPC and continues incandidate
state.
Split votes
The third possible outcome is that a candidate
neither wins nor loses the election. When this happens, each candidate
will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs
. However, without extra measures split votes could repeat indefinitely.
Raft uses randomized election timeouts to ensure that split votes are rare or being resolved quickly if happened. Election timeouts are chosen randomly from a fixed interval (e.g., 150-300ms), this spreads out the servers so that in most cases only a signle server will timeout. While a server (candidate
) wins the election and sends heartbeats before any other servers time out.
Log replication
Once a leader
has been elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machines. The leader
appends the command to its log as a new entry, then issues AppendEntries RPCs
in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader
applies the entry to its state machine and returns the result of that execution to the client. If followers
crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs
indefinitely (even after it has responded to the client) until all followers
eventually store all log entries.
Logs are organized as shown in figure above. Each log entry stores a state machine command along with the term number
when the entry was received by the leader. The term numbers
in log entries are used to detect inconsistencies between logs. Each log entry also has an integer index identifying its position in the log.
The leader decides when it is safe to apply a log entry (Called committed
) to the state machines. Raft guarantees that committed
entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader
that created the entry has replicated it on a majority of the servers which also commits all preceding entries in the leader
’s log, including entries created by previous leaders. The leader
keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs
(including heartbeats) so that the other servers eventually find out. Once a follower
learns that a log entry is committed
, it applies the entry to its local state machine (in log order).
In general, Raft maintains the following properties:
- If two entries in different logs have the same index and term, then they store the same command.
- If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
The first property follows from the fact that a leader
creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries
.
Consistency check
The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property
, and the consistency check preserves the Log Matching Property
whenever logs are extended. As a result, whenever AppendEntries
returns successfully, the leader knows that the follower
’s log is identical to its own log up through the new entries. During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries
consistency check never fails. However, leader
crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader
and follower
crashes.
The image above shows the wats in which follower
‘s logs may differ from that of a new leader
. A follower
may be missing entries that are present on the leader, it may have extra entries that are not present on the leader
, or both. Missing and extraneous entries in a log may span multiple terms. (Ex. leader & follower [f]) Scenario (f) could occur if that server was the leader for term 2
, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3
, and added a few more entries to its log; before any of the entries in either term 2
or term 3
were committed, the server crashed again and remained down for several terms.
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log.
To bring a follower
’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower
’s log after that point, and send the follower
all of the leader
’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs
.
The leader
maintains a nextIndex
for each follower
, which is the index of the next log entry the leader will send to that follower
. When a leader
first comes to power, it initializes all nextIndex
values to the index just after the last one in its log (log index 11 in Figure above). If a follower
’s log is inconsistent with the leader
’s, the AppendEntries consistency check
will fail in the next AppendEntries RPC
. After a rejection, the leader
decrements nextIndex
and retries the AppendEntries RPC
. Eventually nextIndex
will reach a point where the leader
and follower
logs match. When this happens, AppendEntries
will succeed, which removes any conflicting entries in the follower
’s log and appends entries from the leader
’s log (if any). Once AppendEntries
succeeds, the follower
’s log is consistent with the leader
’s, and it will remain that way for the rest of the term.
Safety
This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader
for any given term contains all of the entries committed
in previous terms. Given the election restriction, it make the rules for commitment more precise. Author also present a proof sketch for the Leader Completeness Property
and show how it leads to correct behavior of the replicated state machine
.
Election restriction
In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader
from the moment of its election, without the need to transfer those entries to the leader
. This means that log entries only flow in one direction, from leaders
to followers
, and leaders
never overwrite existing entries in their logs.
A time sequence showing why a leader cannot determine commitment using log entries from older terms.
- In time (a): S1 is leader and partially replicates the log entry at index 2.
- In time (b): S1 crashes; S5 is elected leader for
term 3
with votes from S3, S4, and itself. - In time (c): S5 crashes; S1 restarts and been elected as a leader, and continues replication.
At this point, the log entry from term 2
has been replicated on a majority of the servers, but it is not committed.
- In scenario (d): If S1 crashes again, S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from
term 3
.
However, if S1 replicates an entry from its current term on a majority of the servers before crashing, which generate the new scenario:
- In scenario (e): The entry
term 4
is committed, thus S5 cannot win a election (term 3
<term 4
). At this point all preceding entries in the log are committed as well.
Raft uses the voting process to prevent a candidate
from winning an election unless its log contains all committed entries. A candidate
must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate
’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC
implements this restriction: the RPC includes information about the candidate
’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.
What does “up-to-date” defined?
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
Committing entries from previous terms
As what we talk on previous section, a leader
know that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader
crashes before committing an entry, future leaders
will attempt to finish replicating the entry. However, a leader
cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. The figure in section Election restriction illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.
If S1 (leader
from term T
) commits a new log entry from its term, and S5 is elected leader
for a later term U
, then there must be at least one server (S3) that accepted the log entry and also voted for S5.
To eliminate problems like the one in Figure in section Election restriction, Raft never commits log entries from previous terms by counting replicas but only log entries from the leader’s current term are committed by counting replicas. Once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property
.
Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs.
Followeer and candidate crashes
Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower
or candidate
crashes, then future RequestVote
and AppendEntries RPCs
sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts.
Timing and availability
The ideal feature for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing.
Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:
- broadcastTime << electionTimeout << MTBF (Mean Time Between Failures)[平均故障间隔]
In this inequality broadcastTime
is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout
is the election timeout described in Section Leader election; and MTBF
is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout;
The broadcast time
and MTBF
are properties of the underlying system, while the election timeout
is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time
may range from 0.5ms
to 20ms
, depending on storage technology. As a result, the election timeout
is likely to be somewhere between 10ms
and 500ms
. Typical server MTBFs
are several months or more, which easily satisfies the timing requirement.
Cluster membership changes
Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration. Thus, a automatically configuration changes and incorporate them into the Raft consensus algorithm is needed.
For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure below).
Switching directly from one configuration to another is unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five (green means old, blue means new). Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew). (在图例中出现问题的时刻,Server1可以通过自身以及Server2的投票拿到2/3比例的选票?? 而赢得选举,成为领导者;并且此时Server5可以通过自身和Server3以及Server4的投票 拿到3/5比例的选票?? 赢得选举,最终存在两个领导者。)
The joint consensus
In order to ensure safety, configuration changes must use a two-phase approach. In Raft the cluster first switches to a transitional configuration we call joint consensus
; once the joint consensus
has been committed, the system then transitions to the new configuration. The joint consensus
combines both the old and new configurations:
- Log entries are replicated to all servers in both configurations
- Any server from either configuration may serve as
leader
- Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations
The joint consensus
allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus
allows the cluster to continue servicing client requests throughout the configuration change.
The figure above is the timeline for a configuration change. Dashed lines (…) show configuration entries that have been created but not committed, and solid lines (__) show the latest committed configuration entry. The leader
first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.
When the leader
receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus
(Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. **Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). **This means that the leader
will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader
crashes, a new leader
may be chosen under either Cold or Cold,new depending on whether the winning candidate
has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.
Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property
ensures that only servers with the Cold,new log entry can be elected as leader
. It is now safe for the leader
to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure above, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.
Issues to address for reconfiguration
There are three more issues to address for reconfiguration:
- New servers may not initially store any log entries
If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members
(the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.
- The cluster leader may not be part of the new configuration
In this case, the leader
steps down (returns to follower
state) once it has committed the Cnew log entry. **This means that there will be a period of time (while it is committing Cnew) when the leader
is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. **The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader
.
- Removed servers (those not in Cnew) can disrupt the cluster
These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs
with new term numbers, and this will cause the current leader
to revert to follower
state. A new leader
will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.
To prevent this problem, **servers disregard RequestVote RPCs
when they believe a current leader
exists. **Specifically, **if a server receives a RequestVote RPC
within the minimum election timeout of hearing from a current leader
, it does not update its term or grant its vote. **This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader
is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.
Log compaction
Raft’s log grows during normal operation to incorporate more client requests, but in a practical system, it cannot grow without bound. As the log grows longer, it occupies more space and takes more time to replay. This will eventually cause availability problems without some mechanism to discard obsolete information that has accumulated in the log.
Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded.
Incremental approaches to compaction
These operate on a fraction of the data at once, so they spread the load of compaction more evenly over time. They first select a region of data that has accumulated many deleted and overwritten objects, then they rewrite the live objects from that region more compactly and free the region. This requires significant additional mechanism and complexity compared to snapshotting, which simplifies the problem by always operating on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM trees using the same interface as snapshotting.
Snapshots
Figure description: A server replaces the committed entries in its log (indexes 1 through 5) with a new snapshot, which stores just the current state (variables x and y in this example). The snapshot’s last included index
and term
serve to position the snapshot in the log preceding entry 6.
The figure above shows the basic idea of snapshotting in Raft. Each server takes snapshots
independently, covering just the committed entries in its log. Most of the work consists of the state machine writing its current state to the snapshot
. Raft also includes a small amount of metadata in the snapshot:
last included index
: It is the index of the last entry in the log that the snapshot replaces (the last entry the state machine had applied)last included term
: it is the term of the entry describe above
These are preserved to support the AppendEntries consistency check
for the first log entry following the snapshot, since that entry needs a previous log index and term. To enable cluster membership changes (Last Section), the snapshot also includes the latest configuration in the log as of last included index
. Once a server completes writing a snapshot, it may delete all log entries up through the last included index
, as well as any prior snapshot.
Although servers normally take snapshots independently, the leader
must occasionally send snapshots to followers
that lag behind. This happens when the leader
has already discarded the next log entry that it needs to send to a follower
. Fortunately, this situation is unlikely in normal operation: a follower
that has kept up with the leader
would already have this entry. However, an exceptionally slow follower
or a new server joining the cluster (as descript in Section Cluster membership changes) would not. The way to bring such a follower up-to-date is for the leader to send it a snapshot over the network.
The figure above is a summary of the InstallSnapshot RPC
. Snapshots are split into chunks for transmission; this gives the follower
a sign of life with each chunk, so it can reset its election timer.
The leader
uses a new RPC called InstallSnapshot
to send snapshots to followers
that are too far behind. When a follower
receives a snapshot with this RPC, it must decide what to do with its existing log entries. Usually the snapshot will contain new information not already in the recipient’s log. In this case, the follower
discards its entire log (it is all superseded by the snapshot and may possibly have uncommitted entries that conflict with the snapshot). If instead the follower
receives a snapshot that describes a prefix of its log (due to retransmission or by mistake), then log entries covered by the snapshot are deleted but entries following the snapshot are still valid and must be retained.
Issues that impact snapshotting performance
There are two more issues that impact snapshotting performance.
- Servers must decide when to snapshot
If a server snapshots too often, it wastes disk bandwidth and energy; if it snapshots too infrequently, it risks exhausting its storage capacity, and it increases the time required to replay the log during restarts. One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.
- Writing a snapshot can take a significant amount of time
The solution is to use Cow (copy-on-write) techniques so that new updates can be accepted without impacting the snapshot being written. For example, state machines built with functional data structures naturally support this. Alternatively, the operating system’s copy-on-write support (e.g., fork on Linux) can be used to create an in-memory snapshot of the entire state machine (our implementation uses this approach).
Client interaction
This section describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics (线性化语义).
Clients of Raft send all of their requests to the leader
. When a client first starts up, it connects to a randomly chosen server. If the client’s first choice is not the leader
, that server will reject the client’s request and supply information about the most recent leader
it has heard from (AppendEntries
requests include the network address of the leader
). If the leader
crashes, client requests will time out; clients then try again with randomly-chosen servers.
Linearizable semantics
The goal for Raft is to implement linearizable semantics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and its response). However, as described so far Raft can execute a command multiple times: for example, if the leader
crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader
, causing it to be executed a second time.
The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request.
Precautions for returning stale data
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader
responding to the request might have been superseded by a newer leader
of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log.
- A
leader
must have the latest information on which entries are committed
The Leader Completeness Property
guarantees that a leader
has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader
commit a blank no-op entry into the log at the start of its term.
- A
leader
must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected).
Raft handles this by having the leader
exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader
could rely on the heartbeat mechanism to provide a form of lease, but this would rely on timing for safety (it assumes bounded clock skew).