How to implement reliable and efficient leader lease
I have been designing a timestamp service recently. During a discussion, I learned that the previous system may have clock drift problem due to incorrect lease. This is actually a very typical problem, so I write an article to discuss how to implement a reliable leader lease.
What is a leader lease?
In a distributed system, consensus algorithms like Raft and Paxos are used to provide consistency. These algorithms make proposals persistent in quorum to ensure a committed proposal will not be lost. In the classical Paxos, any proposer can propose a proposal. So it’s very likely to be conflict with each other. In Multi-Paxos and its related variants, a leader is elected to propose proposals. In Raft, leader role is part of the algorithm. The leader role can handle both reads and writes by persisting logs in quorum, which can lead to high latency. In practice, a delay will be added to election so that when a node becomes leader, all nodes that have voted for it will promise not to grant vote for others in a period of time. That promised time is lease. If everything goes well, there should be a single leader in a valid lease. As leader handles all proposals, so it has the latest data. During valid lease, it can handle all proposal that won’t cause side effect, like read for example, without committing to quorum.
But how to tell if a lease is valid? The Raft paper suggests broadcasting a round of heartbeats when leader is about to handle a read request. If it can still receive quorum response, then quorum still haven’t vote for others yet, so lease is valid. Heartbeats is cheaper than going through the whole consensus process. But it still needs a network roundtrip to finish. If the system is deployed across data centers, such overhead can’t be emitted. Latency sensitive system may choose to check lease locally. Because followers have promised not to vote for other node in lease, so to check if lease is valid, just check if the current time is still in lease.
In distributed system, get current time is not a trivial task. Every node may have clock drift due to various reasons. Use a wrong local time to check lease may produce false positive and lead to two leaders in a lease. Although consensus algorithm can ensure data consistency even in such case, but read request may still read stale data. For example, if a client writes a pair of key-value to the new leader that updates (k0, v0)
to (k0, v1)
, and then reads from the old leader, it may still read v0
. As per a timestamp service, old leader may trigger a write request to ensure it can allocate timestamp in [t0, to + d)
and new leader may allocates in [t0 + d, t0 + 2d)
. As both two leaders think they are in valid leases, so a client may get t0 + d + 1
first and then get t0 + 1
, hence clock drift.
Time synchronization
To make local time correct, distribution system may require deploying time synchronization service like NTP (Network Time Protocol). NTP service will query some time authority and adjust local clock to match the remote one. The adjustment make change the time to the future or back to the past. Time synchronization service greatly reduces the probability of clock drift, but not prevent it completely however. NTP can make time error within milliseconds usually in the same data center and hundreds of milliseconds across data centers. In extreme cases, like network partition, a node may still completely lost synchronization. But anyway, distributed systems rely on time synchronization service will set a global time offset tolerance. If an error is detected to about reach the offset on a node, the process may kill itself to avoid breaking correctness.
There are some other distributed database that uses HLC (Hybrid Logical Clock) to synchronize time. Of course, HLC is introduced mainly for transactions to get a reliable timestamp source, but it happens to be a good fit in for lease implementations. HLC uses both physical clock and logical clock to keep time increasing monotonically. The physical clock in HLC also needs NTP to keep the clock syncing to real time. Different from NTP, HLC also sync time during communications between nodes in a cluster. So that even if some nodes are out of sync to physical time, HLC can still make the clock close to the known physical time in a best effort. Generally HLC has less errors than NTP, but still not none. The upper bound of error is the same as NTP. And just like the systems rely on NTP only, systems rely on HLC also have a setting about time offset tolerance and may kill itself if about to reach.
Physical clock
Since software based clock is not so reliable, some distributed systems introduce more accurate hardware like atomic clock and GPS. The well known distributed database Spanner utilizes such clocks and develops so call TrueTime. You can tell from the name that Google is quite confident that it’s the real physical time. Different from normal time API, TrueTime returns
a range of time when querying current time, which means the current time must be in the range. The size of the range depends on errors. Supposing leader proposes a lease d
at time [t0, t1]
, and it receives quorum confirmation at [t2, t3]
, then its lease is [t3, t0 + d]
.
Unlike time synchronization mentioned above, TrueTime doesn’t depend on the errors assumptions, it nearly guarantees the accuracy of time and make lease check very reliable. The only problem is physical clocks are expensive and not generally available. But in recent years, some cloud providers like AWS also deploys such hardware to their data centers. For example, you can use clock-bound to implement TrueTime in AWS.
Relative time
We can see that getting accurate current time is not easy. But implementing a lease doesn’t require physical time. As long as every nodes pause votes for almost same long time, then we can still get reliable lease. So instead of depending on physical wall clock, we can depend on steady clock. Take Linux as an example, there are 4 different clocks: CLOCK_REALTIME
, CLOCK_MONOTONIC
, CLOCK_MONOTONIC_RAW
and CLOCK_BOOTTIME
. CLOCK_REALTIME
is the wall clock, and the others are relative clock that starts from some unknown time point. CLOCK_MONOTONIC
is the most commonly used clock, it’s expected to increase monotonically. But it can also be affected by NTP time adjustment. And it also doesn’t record suspensions. So the clock may not be steady as it seems. CLOCK_MONOTONIC_RAW
utilizes the clock source of CPU and not subject to NTP adjustments. CLOCK_BOOTTIME
is the same as CLOCK_MONOTONIC
but also includes the time the system is suspended. There are distributed systems utilizes all these three clocks. I prefer to use CLOCK_MONOTONIC_RAW
as it’s more steady. Note TiKV used to use CLOCK_MONOTONIC
in the very early version but found clock drifts when deploying in VMs. Hence it switches to CLOCK_MONOTONIC_RAW
in the end.
Does lease becomes reliable when using CLOCK_MONOTONIC_RAW
? Not always. Whether CLOCK_MONOTONIC_RAW
is steady depends on the TSC of CPU. On x86 architecture, TSC is implemented using CPU’s frequency. It means it may not be steady due to frequency changes caused by various reasons like power saving. Luckily, recent x86 CPUs supports constant TSC, which means TSC ticks at the processor’s nominal frequency. In that case the clock is steady enough for short time lease. You can check whether /proc/cpuinfo
has flag constant_tsc
to tell whether CPU supports the feature.
Conclusion
Lease is a common and important mechanism in distributed systems. Efficient and reliable lease has tight relationship with accurate time. This article introduces several solutions of lease in the industry including time synchronization, physical clock and relative clock. No solution is perfect, be wise and choose the best fit for your system.
As for the timestamp service I’m designing, I may choose to use CLOCK_MONOTONIC_RAW
as the clock source. It’s steady and clean, good enough for my case.