All contents are modified and cite from the paper “In Search of an Understandable Consensus Algorithm (Extended Version)” by Diego Ongaro and John Ousterhout at Stanford University. It could be checked on the Raft consensus algorithm website. For a Chinese translate version, check 译 - In Search of an Understandable Consensus Algorithm (Extended Version) | 咕咕 (bugwz.com).

This only talk parts of the paper, for full version check paper directly!

Intro

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but with a more understandable structure than Paxos.

Raft is similar in many ways to existing consensus algorithms, but it has several novel features:

  • Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. Ex. It simplifies the management of the replicated log by set log entries only flow from the leader to other servers.
  • Leader election: Raft uses randomized timers to elect leaders. This adds some mechanism based on the heartbeats(Implemented on PA1) from usual consensus algorithm, which makes conflicts resolve simply and rapidly.
  • Membership changes: Raft uses a new joint consensus approach for the mechanism of changing the set of server in the cluster. The majorities of two different configurations overlap during transitions which allows the cluster to continue operating normally during configuration changes.

Replicated state machines

Replicated state machines are used to solve a variety of fault tolerance problems in distributed system since it is a collection of servers compute identical copies of the same state that used to replace the servers who is down which make system could continue operating.

How does it be replicated?

The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.

Replicated state machines structure

As shown in figure above, replicated state machines are typically implemented using a replicated log:

  1. Each server stores a log containing a series of commands, which its state machine executes in order.
  2. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands.
  3. Since the state machines are deterministic, each computes the sam estate and the same sequence of outputs.

Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log, then communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

Properties

Consensus algorithms for practical systems typically have the following properties:

  • They ensure safety under all non-Byzantine conditions (never returning an incorrect result), including network delays, partitions, and packet loss, duplication, and reordering.
  • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients.
    • Thus, a typical cluster of 5 servers can tolerate the failure of any 2 servers.
    • Servers may later recover from state on stable storage and rejoin the cluster while they are assumed to fail by stopping.
  • They do not depend on timing to ensure the consistency of the logs.
    • Faulty clocks and extreme message delays can, at worst, cause availability problems.
  • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls. (A minority of slow servers need not impact overall system performance.)

What’s wrong with Paxos?

Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

However, Paxos has two significant drawbacks:

Paxos’ understanding difficulty

Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. It is divided into two stages that do not have simple intuitive explanations and cannot be understood independently, which makes it difficult to develop intuitions about why the single decree protocol works.

Poor for building practical systems

Paxos does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos since Lamport’s description only sketched possible approaches to mult-Paxos, nut many detail are missing.

Lamport’s description are mostly about single-decree Paxos, which also caused Paxos architecture became a poor one for building practical systems by the single-decree decomposition.

  1. Paxos choosing a collection of log entries independently and then melding them into a sequential log which just adds complexity. (It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order)
  2. Paxos uses a symmetric p2p (peer-to-peer) approach at its core (Although it suggests a weak form of leadership as a performance optimization). It would only makes sense in a simplified world where only one decision will be made. (It is simpler and faster to first select a leader, then have the leader coordinate the decisions)*.

As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, but end with a significantly different architecture due to the difficulties in implementing Paxos. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value.

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system… the final system will be based on an unproven protocol. –Chubby inmplmenters

A typical comment about the Paxos’ lack of practical systems implementation.

Raft Designing for understandability

There are several goals in designing Raft:

  1. It must provide a complete and practical foundation for system building.
  2. It must be safe under all conditions and available under typical operating conditions.
  3. It must be efficient for common operations.
  4. It must be possible for a large audience to understand the algorithm comfortably. (It should gives audience intuitions about algorithm, which make the extensions inevitable in real-world implementations)

It used two techniques that makes analysis generally applicable.

Problem decomposition

It divided problems into separate peces that could be solved, explained, and understood relatively independently.

Ex. Raft separated leader election, log replication, safety, and membership changes.

Simplify the state space

It simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible.

In Raft, logs are not allowed to have holes and it limits the ways in which logs can become inconsistent with each other. Although in most cases our goal is to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. (In particular, randomized approaches introduce nondeterminism since “It doesn’t matter which one to choice.”)

Ex. Raft use randomization to simplify the leader election algorithm.

The Raft consensus algorithm

Raft is an algorithm for managing a replicated log of the form described in Section Replicated State Machine.

Summary of the Raft

A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction) are shown below:

Summary of the Raft

The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. (Section numbers such as §5.2 is meanless for this notes) A formal specification describes the algorithm more precisely.


Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines which simplifies the management of the replicated log. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems:

  • Leader election: a new leader must be chosen when an existing leader fails.
  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.
  • Safety: the key safety propoerty for Raft is the State Machine Safety Property in Figure below. if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.

Raft guarantees

Raft guarantees that each of these properties shown above is true at all times. (Section numbers such as §5.2 is meanless for this notes)

Raft basics

A Raft cluster contains several servers. At any given time each server is one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers.

Leader

The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader.)

Follower

Followers are passive, they issue no requests on their own but simply respond to requests from leader and candidates.

Candidate

Candidate is used to elect a new leader as described above.

Raft Server States

Raft Server States

The figure above shows the states and their transitions. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail.

Raft Terms

Raft divides time into terms of arbitray length, as shown below:

Raft Terms

Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader. In some situations an election will result in a split vote, in this case the term will end with no leader, and a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Raft communication

Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. (The third RPC has formed later)

  • RequestVote RPCs: They are initiated by candidates during elections.
  • AppendEntries RPCs: They are initiated by leaders to replicate log entries and provide a form of heartbeat.
  • InstallSnapshot RPCs: Used for transferring snapshots between servers.

Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

Leader election

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

To begin an election, a follower increments its current term (Ex. Term 3 ➡️ Term 4) and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until meets one of three conditions :

  1. It wins the election
  2. Another server wins election (establishs itself as leader)
  3. A period of time goes by with no winner

Each sever vote one candidate in a given term with fcfs basis. A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader and will have two possible situations:

  • Accept: If leader‘s term (included in its RPC) is at least as large as the candidate‘s current term, then the candidate recognizes the leader as legitimate and returns to follower state.
  • Decline: If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

Split votes

The third possible outcome is that a candidate neither wins nor loses the election. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.

Raft uses randomized election timeouts to ensure that split votes are rare or being resolved quickly if happened. Election timeouts are chosen randomly from a fixed interval (e.g., 150-300ms), this spreads out the servers so that in most cases only a signle server will timeout. While a server (candidate) wins the election and sends heartbeats before any other servers time out.

Log replication

Once a leader has been elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

Raft Logs

Logs are organized as shown in figure above. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs. Each log entry also has an integer index identifying its position in the log.

The leader decides when it is safe to apply a log entry (Called committed) to the state machines. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers which also commits all preceding entries in the leader’s log, including entries created by previous leaders. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

In general, Raft maintains the following properties:

  • If two entries in different logs have the same index and term, then they store the same command.
  • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries.

Consistency check

The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries. During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes.

Different Logs

The image above shows the wats in which follower‘s logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms. (Ex. leader & follower [f]) Scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log.

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs.

AppendEntries consistency check

The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (log index 11 in Figure above). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

Safety

This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms. Given the election restriction, it make the rules for commitment more precise. Author also present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.

Election restriction

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

Time Sequence

A time sequence showing why a leader cannot determine commitment using log entries from older terms.

  • In time (a): S1 is leader and partially replicates the log entry at index 2.
  • In time (b): S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself.
  • In time (c): S5 crashes; S1 restarts and been elected as a leader, and continues replication.

At this point, the log entry from term 2 has been replicated on a majority of the servers, but it is not committed.

  • In scenario (d): If S1 crashes again, S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3.

However, if S1 replicates an entry from its current term on a majority of the servers before crashing, which generate the new scenario:

  • In scenario (e): The entry term 4 is committed, thus S5 cannot win a election (term 3 < term 4). At this point all preceding entries in the log are committed as well.

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

What does “up-to-date” defined?

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Committing entries from previous terms

As what we talk on previous section, a leader know that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. The figure in section Election restriction illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.

Overlay Log

If S1 (leader from term T) commits a new log entry from its term, and S5 is elected leader for a later term U, then there must be at least one server (S3) that accepted the log entry and also voted for S5.

To eliminate problems like the one in Figure in section Election restriction, Raft never commits log entries from previous terms by counting replicas but only log entries from the leader’s current term are committed by counting replicas. Once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs.

Followeer and candidate crashes

Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts.

Timing and availability

The ideal feature for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing.

Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

  • broadcastTime << electionTimeout << MTBF (Mean Time Between Failures)[平均故障间隔]

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section Leader election; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout;

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

Cluster membership changes

Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration. Thus, a automatically configuration changes and incorporate them into the Raft consensus algorithm is needed.

For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure below).

Cluster Membership Changes

Switching directly from one configuration to another is unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five (green means old, blue means new). Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew). (在图例中出现问题的时刻,Server1可以通过自身以及Server2的投票拿到2/3比例的选票?? 而赢得选举,成为领导者;并且此时Server5可以通过自身和Server3以及Server4的投票 拿到3/5比例的选票?? 赢得选举,最终存在两个领导者。)

The joint consensus

In order to ensure safety, configuration changes must use a two-phase approach. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:

  • Log entries are replicated to all servers in both configurations
  • Any server from either configuration may serve as leader
  • Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations

The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.

Timeline For A Configuration Change

The figure above is the timeline for a configuration change. Dashed lines (…) show configuration entries that have been created but not committed, and solid lines (__) show the latest committed configuration entry. The leader first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.

When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. **Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). **This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.

Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure above, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.

Issues to address for reconfiguration

There are three more issues to address for reconfiguration:

  1. New servers may not initially store any log entries

If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.

  1. The cluster leader may not be part of the new configuration

In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. **This means that there will be a period of time (while it is committing Cnew) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. **The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader.

  1. Removed servers (those not in Cnew) can disrupt the cluster

These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.

To prevent this problem, **servers disregard RequestVote RPCs when they believe a current leader exists. **Specifically, **if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. **This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.

Log compaction

Raft’s log grows during normal operation to incorporate more client requests, but in a practical system, it cannot grow without bound. As the log grows longer, it occupies more space and takes more time to replay. This will eventually cause availability problems without some mechanism to discard obsolete information that has accumulated in the log.

Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded.

Incremental approaches to compaction

These operate on a fraction of the data at once, so they spread the load of compaction more evenly over time. They first select a region of data that has accumulated many deleted and overwritten objects, then they rewrite the live objects from that region more compactly and free the region. This requires significant additional mechanism and complexity compared to snapshotting, which simplifies the problem by always operating on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM trees using the same interface as snapshotting.

Snapshots

Log Compression

Figure description: A server replaces the committed entries in its log (indexes 1 through 5) with a new snapshot, which stores just the current state (variables x and y in this example). The snapshot’s last included index and term serve to position the snapshot in the log preceding entry 6.

The figure above shows the basic idea of snapshotting in Raft. Each server takes snapshots independently, covering just the committed entries in its log. Most of the work consists of the state machine writing its current state to the snapshot. Raft also includes a small amount of metadata in the snapshot:

  • last included index: It is the index of the last entry in the log that the snapshot replaces (the last entry the state machine had applied)
  • last included term: it is the term of the entry describe above

These are preserved to support the AppendEntries consistency check for the first log entry following the snapshot, since that entry needs a previous log index and term. To enable cluster membership changes (Last Section), the snapshot also includes the latest configuration in the log as of last included index. Once a server completes writing a snapshot, it may delete all log entries up through the last included index, as well as any prior snapshot.

Although servers normally take snapshots independently, the leader must occasionally send snapshots to followers that lag behind. This happens when the leader has already discarded the next log entry that it needs to send to a follower. Fortunately, this situation is unlikely in normal operation: a follower that has kept up with the leader would already have this entry. However, an exceptionally slow follower or a new server joining the cluster (as descript in Section Cluster membership changes) would not. The way to bring such a follower up-to-date is for the leader to send it a snapshot over the network.

A summary of the InstallSnapshot RPC

The figure above is a summary of the InstallSnapshot RPC. Snapshots are split into chunks for transmission; this gives the follower a sign of life with each chunk, so it can reset its election timer.

The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are too far behind. When a follower receives a snapshot with this RPC, it must decide what to do with its existing log entries. Usually the snapshot will contain new information not already in the recipient’s log. In this case, the follower discards its entire log (it is all superseded by the snapshot and may possibly have uncommitted entries that conflict with the snapshot). If instead the follower receives a snapshot that describes a prefix of its log (due to retransmission or by mistake), then log entries covered by the snapshot are deleted but entries following the snapshot are still valid and must be retained.

Issues that impact snapshotting performance

There are two more issues that impact snapshotting performance.

  1. Servers must decide when to snapshot

If a server snapshots too often, it wastes disk bandwidth and energy; if it snapshots too infrequently, it risks exhausting its storage capacity, and it increases the time required to replay the log during restarts. One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.

  1. Writing a snapshot can take a significant amount of time

The solution is to use Cow (copy-on-write) techniques so that new updates can be accepted without impacting the snapshot being written. For example, state machines built with functional data structures naturally support this. Alternatively, the operating system’s copy-on-write support (e.g., fork on Linux) can be used to create an in-memory snapshot of the entire state machine (our implementation uses this approach).

Client interaction

This section describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics (线性化语义).

Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomly chosen server. If the client’s first choice is not the leader, that server will reject the client’s request and supply information about the most recent leader it has heard from (AppendEntries requests include the network address of the leader). If the leader crashes, client requests will time out; clients then try again with randomly-chosen servers.

Linearizable semantics

The goal for Raft is to implement linearizable semantics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and its response). However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time.

The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request.

Precautions for returning stale data

Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log.

  1. A leader must have the latest information on which entries are committed

The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term.

  1. A leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected).

Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease, but this would rely on timing for safety (it assumes bounded clock skew).