BusyJay

Extending Raft algorithm: Learner, Part 1

· Jay

Raft is one of the well known and implemented consensus algorithms. It is very easy to understand and implement due to its explicit leader role, election and builtin configuration change. The original algorithm is sufficient in most cases, but we still need to extend the algorithm to meet complicated requirements. For example, when adding a new member to a raft group with 3 members, the new node may not have enough logs to start taking part in helping leader commit logs. If there is a follower crashing, the group may be unavailable for a long time. Another case is if we want non-leader replicas handle read workload and leader handle write workload. To scale read handling, we may add more replicas to the group. However this may hurt write performance as more node means larger quorum and take longer time to reach consensus. And in some HTAP databases, read-write replicas may store data in row format, and read-only replicas store data in column format.

Clearly we want a new type of member in Raft that won’t take part in log decisions and elections (so it won’t affect write or even choose different data format), but still have the whole data in the end (so can read). This is the subject of this blog: Learner. The concept learner comes from Paxos. As its name tells, it keeps learning from the consensus group and handles read queries. In Raft, we make leader replicate logs to regular members and learners at the same time. But committing logs and election only count regular members. The Raft paper call these learner members as non-voting members. We call them learners for simplicity and call those regular members as voters. Though Raft paper mentioned learners (non-voting members), the primary purpose is to solve the first problem described above: unavailability caused by slow log replication. So allow learners transform to voters is enough. But other problems require learners can be transformed to voters and vice versa.

log replication Graph 1: Example of log replication, blue circles are voters, green one is learner. Straight line means log replication result will be considered in committing logs.

The transformation between voters and learners is a change to the property of members. Raft handles such changes using joint consensus. More specifically, when changing from one membership configuration to another, both configurations will be serialized as a log entry and be appended to leader’s logs. Every members that replicate the logs will enter joint stage immediately, in which both membership configurations are taking effect. Logs need to be replicated to the quorum of both configurations before being committed. When the joint log is committed, a leave joint log will be appended and replicated. Similarly, every member leaves joint stage once replicate the log and only the new configuration will take effect. You may notice that the configuration logs take effect immediately once replicated, they are different from the regular logs, which take effect only when they are committed. In the industry, some implementations, Etcd for example, adapts the algorithm to make configuration logs also take effect after they are committed. This blog will focus on the original Raft and learners, updated Raft and learners will be left to next blog.

joint consensus Graph 2: Joint consensus (source: In Search of an Understandable Consensus Algorithm)

Apparently, adding or removing learners have no change to the Raft algorithm except adding or removing log replication targets, so it should be OK to apply the changes without using joint consensus. The transformation between voters and learners will change quorum, so we need to use joint consensus to apply the changes. According to joint consensus, both new and old membership configurations take effect, and voter needs to participate in election and committing logs, so its priority is higher than learner. So as long as a member is voter in either old or new configuration, it behaves like voter. If it’s learner in both configurations, it behaves like learner.

Consider 3 AZs(available zone), ABC, and 4 nodes, abcd. a is in AZ A, b is in AZ B and both c and d are in AZ C. abc are voters, d is learner. We use joint consensus to transform c to learner, d to voter. Let a be the leader, and it crash immediately right after only replicating the joint log to b. According to the election timeout rules, both voter b and c will start campaign. c has fewer logs than b, so it won’t get b’s approval, so election has to fail. b will get c’s vote, but it enters joint mode already, so it also ask d for vote. However d considers itself as learner, according to previous rules, learner should not participate in election, so d won’t vote for b. b doesn’t get enough votes from the new configuration, its election also fails. So the group becomes unavailable. This breaks the high availability of Raft. The key here is d should vote for b. If we allow learners become voters, learners can’t stay away from election anymore. So we need to update the definitions: learners are members that only replicates logs and vote for others, they shouldn’t be considered in committing logs and should not start campaign actively.

vote a Graph 3: if learner d doesn’t vote for voter b, voter b won’t get enough votes from both configurations.

After allowing learner to vote, b becomes leader and continue the joint consensus process. After joint log is committed, b append and replicates the exit joint log. Supposing a is restarted already, and b crashes immediately right after only replicating the exit joint log to c. In this situation, both a and d are still in joint stage, c leaves the joint stage and behaves as a learner. Both a and d won’t get vote from c because they have fewer logs. And c, as a learner, won’t start campaign according to rules described above. So this raft group becomes unavailable again. One solution is to even allow learners to start campaign. It will lead to a very strange group though: the leader of a raft group is a learner. This in fact is partially allowed in Raft algorithm that a removed leader can still lead the raft group for some time. In Paxos, the roles proposer, acceptor and learner are allowed to mixed arbitrarily. But practically, allow a learner become leader can lead to chaos. For example, leader may assume it’s always part of quorum in some implementations; from performance perspective, if a leader is not in a quorum, it’s the same as increasing the member number in quorum and can hurt write performance; from deploy perspective, learners are meant to be read-only that they might be deployed to some nodes that are not OK to become leader. So it’s better to disallow learners to become leader.

vote b Graph 4: when learner c has more logs, voter a and d can’t win the election

Transform voters to learners, has the same effect as removing members on the original algorithm. When a member is removed from the configuration, leader will not send any logs to the member according to Raft, aka the removed member will never learn it’s removed from the group in Raft. So when the leader leaves joint stage and demote a voter to learner, it should not send the leave joint log to the learner. However, this defeats the purpose of learners: it should have the whole logs in the end. The key here is quorum rules is conflict with the replication, if learners are only allowed to learn committed logs, then learners must have fewer logs than quorum, so it won’t affect the election. The rule can be relaxed further as if a log is a leave joint log, it should be committed before being replicated to learners.

For example, during joint consensus, if a member is voter in the old configuration and learners in the new configuration, then it behaves as voter and possibly win an election to become leader. According to the above rule, leave joint log should not be appended to its own logs, so it should stay in joint stage. To leave joint stage, it should transfer leader to other voters and let the new leader to propose leaving joint stage.

In practice, one common mistake is consider learner as the same role as leader, follower and candidate. As discussed above, learner is the property of a member, it’s in the same category as voter. Leader, follower and candidate are roles in Raft algorithm. In most cases, learners are followers. During joint consensus, a member can be voter and learner at the same time, it behaves as voter, so it can be either leader, follower and candidate. Learners are orthogonal to these roles.

To summarize, extending Raft algorithm with learners requires:

  1. Learners only replicate logs and vote for others, they shouldn’t be considered in committing logs and should not start campaign actively;
  2. During joint consensus only a member is learner in both configurations should behave as learner;
  3. Uncommitted leave joint logs should not be replicated to learners.

Are these 3 rules enough? Rule 2 make learners behaves as if deleted members in the original algorithm, so learners won’t affect correctness. And the high availability of Raft means whenever there is a live quorum, these will be consensus. If learner can break the guarantee, then learner has to be part of the quorum. Rule 1 guarantee learners can vote, rule 2 and rule 3 guarantee if a learner is part of the quorum, there must be at least one member in the quorum that has more logs than the learner, so there must be other members that will be elected as leader eventually. So availability is also the same as the original algorithm. I just reason about these two properties roughly here, and will prove it formally once I find a bigger paper :).