Mark Callaghan pointed me to a paper for my comments: Strong and Efficient Consistency with Consistency-Aware Durability by Ganesan, Alagappan and Arpaci-Dusseau ^2. It won Best Paper award at the Usenix Fast '20 conference. The paper presents a new consistency level for distributed databases where reads are causally consistent with other reads but not (necessarily) with writes. My comments are mostly on section 2 of the paper, which describes current state of the art and a motivation for their work. In Table 1 the authors correctly point out that making database transactions durable by way of replicating them is usually faster than syncing to disk. The last line of the table also makes the point that an even faster option is to neither replicate nor sync to disk synchronously (for example, writeConcern=1 in MongoDB). This is of course correct and obvious, but uninteresting for this blog post and I will ignore this last configuration for the rest of this post. The choice between systems that replicate AND fsync to disk synchronously vs one that merely replicates synchronously is a bit of a religious debate. Many database users and designers feel that data can only be considered durable once it has been written to disk. A more modern school of practitioners claim that replicating to a majority of nodes in the cluster is sufficient. The tagline for this approach is the "writing to remote memory is faster than local disk". More precisely, the latter approach guarantees durability against failures where a majority of nodes are still healthy and can talk to each other. The former approach can be robust against more catastrophic failures than that, but the performance cost is significant. The paper then makes this claim about such majority-committing systems:
Given the cost of synchronous durability, many systems prefer asynchronous durability in which writes are replicated and persisted lazily. In fact, such asynchronous configurations are the default [32, 44] in widely used systems (e.g., Redis, MongoDB). However, by adopting asynchronous durability, as we discuss next, these systems settle for weaker consistency. Most systems use two kinds of asynchronous durability configurations. In the first kind, the system synchronously replicates, but persists data lazily (e.g., ZooKeeper with forceSync disabled). In the second, the system performs both replication and persistence asynchronously (e.g., default Redis, which buffers updates only on the leader’s memory). With asynchronous persistence, the system can lose data, leading to poor consistency. Surprisingly, such cases can occur although data is replicated in memory of many nodes and when just one node crashes. Consider ZooKeeper with asynchronous persistence as shown in Figure 1(i). At first, a majority of nodes (S1, S2, and S3) have committed an item b, buffering it in memory; two nodes (S4 and S5) are operating slowly and so have not seen b. When a node in the majority (S3) crashes and recovers, it loses b. S3 then forms a majority with nodes that have not seen b yet and gets elected the leader†. The system has thus silently lost the committed item b and so a client that previously read a state containing items a and b may now notice an older state containing only a, exposing non-monotonic reads. The intact copies on S1 and S2 are also replaced by the new leader. Similar cases arise with fully asynchronous systems too as shown in Figure 1(ii).
Wait what? This goes against over a decade of experience actually using real world distributed databases that commit via majority replication and defer disk writes to later. What's going on here? (Note again that the last sentence is obvious and uninteresting: asynchronously replicated systems can lose writes by design.) The above example is based on ZooKeeper and if I'm correctly informed, ZooKeeper replication is not based on Raft, rather a protocol called ZAB. That said, as readers of this blog may be most familiar with Raft, I will use Raft + one modification as a well defined algorithm to explain how the above is maybe correct in some abstract academic sense but not in the real world. If you haven't read the Raft thesis, you really should do it anyway. The literal Raft algorithm describes a system where all nodes persist operations to disk, and then report back to a leader. When a majority of nodes have reported back to the leader, the leader returns information to the client that the commit was successful. During failovers an election protocol requires a majority of nodes to agree on a new leader, which guarantees that at least one healthy node has all the majority-committed writes and therefore no writes are ever lost. In the above quoted section Ganesan et.al. assume a system that is like Raft, but the nodes don't immediately write to disk. My claim is that such a system is still robust against any failure of a minority of nodes in the cluster. Ganesan et.al. claim that even just a single node restarting can cause properly majority-committed writes to be lost. Who is right and who is wrong? On a literal reading, the paper is actually correct! A modified Raft as I described above would exhibit this type of data loss. The only crux is that real world databases I'm familiar with do not lose data like that. (Maybe ZooKeeper does, in which case hopefully they can read this post and fix their issue.) Real world databases that I am aware of would also implement a so called pre-vote algorithm before the actual majority election is allowed to proceed. The Raft thesis already mentions that this is a good idea, but doesn't spell out the steps how such an algorithm should be implemented. I have written a post on how to do that in 2015. A distributed database that implements a pre-vote algorithm is not vulnerable to the data loss described above. What would happen when S3 restarts and calls for an election is that S4 and S5 will respond with: "No thank you, we're fine, S1 is still the leader and you should follow them." Crisis averted! (For example, MongoDB implements a pre-vote step in its failover algorithm and would behave like this.) Updates: In correspondance with Ganesan I've learned that ZAB is similar to Raft as far as the above example is concerned. It is designed to use sync-to-disk, but Zookeeper allows users to turn that off. Ganesan also pointed out that a 2012 upgrade to Viewstamped Replication does not require nodes to fsync replicated transactions. The Recovery phase of VR is indeed similar to above suggestion. See comments for more discussion on this.
But what about...
A reasonable reader might now ask a follow-up question: What if S4 and S5 are also partitioned from S1, so that they no longer have a leader? Would they now accept S3 as a new leader and cause the data loss as described in the paper? Yes! But this still works as designed, because we now had 3 out of 5 nodes failing. 1 node crashed and 2 other were partitioned, so this means a majority of nodes failed at the same time. The guarantee to be robust against failures of a minority of nodes is not violated. But wait, there's more! What if the leader, S1 fails, loses transaction b, restarts and then calls for an election and becomes leader? Since the leader has failed, the pre-vote algorithm will not prevent S1 from calling an election, getting votes from S4 and S5, and becoming the new leader of a cluster that has lost transaction b. Even this scenario is unlikely to ever happen in the real world, because S1 would have to fail and restart faster than the other nodes will hit their election timeout, which is a matter of seconds. But it nevertheless CAN happen, with however small probability. It's a shame that Ganesan et.al. did not think of presenting this example, because that would have been a ground breaking contribution also in the real world, not just in an unrealistic academic scenario. So additional precaution is needed to prevent data loss when the leader fails in a commit-by-replication cluster. What can we do? The answer is that a crashed node must refrain from participating in the next election. It can neither stand for election nor vote in it. It must wait until there is a leader elected from the healthy nodes, one of whom is guaranteed to have all the committed transactions and then it must rejoin the cluster first as a follower. Only after this can it participate in elections again. This is fairly straightforward to implement, nodes just need to persist to the end of their log a note that they did shutdown cleanly. However, there's an interesting corner case for catastrophic failures: If all (or a majority of) nodes crash, the cluster cannot start again because none of the nodes will call for election. Such a cluster needs to add a command to allow the administrator to manually restart the cluster after such a catastrophic crash. By manually calling this command, the administrator is also acknowledging that the catastrophic failure could have lost data. There's yet one more variation on this failure: If S1 and S3 fail together, it's also possible that S3 can call for election and win it because the other nodes no longer have a leader. The above procedure therefore applies both to S1 and S3. That is, both to leader and follower. (This paragraph added 2020-03-30.)
The rest of the paper
Having spent all of my time and energy just on Figure 1, I kind of just skimmed the rest. So I can't vouch for the new consistency level the paper presents, but assuming it is correct... Cross-client monotonic reads is a consistency level between causal consistency (which provides a linear experience for each client separately) and full (read-write) linearizability. Not being able to read your own writes may at first sound like a weird consistency level to strive for, but it can be useful for at least one obvious class of use cases: applications where some threads only write to and other threads only read from the database. The paper mentions an increasing counter (of likes, say) as one example. Also various monitoring and other IoT use cases would fit this definition. While this post was focused on synchronous replication with asynchronous disk writes, it is worth emphasizing that the new consistency level is still valid and a useful contribution to applications that want to use asynchronous replication with asynchronous disk writes. Many IoT and monitoring applications can tolerate a small (and sometimes large) amount of data loss, and providing a relatively strong consistency for reads is a welcome invention.
Since Raft specifies that nodes must fsync all commits to disk, the alternative where you don't has been less studied. Yet in practice this is a common configuration for real world databases! Ganesan et.al. present a claim that a distributed state machine doing only majority commits without fsync could lose committed transactions due to a mere single node failure. My response is that their example is not something that could actually happen in well designed real world systems. In particular, implementing a pre-vote step will prevent the data loss. However, further analysis shows that data loss is possible if the leader node is one of the failing nodes. To prevent this from happening, distributed databases that only do majority commits need to prevent failed nodes from voting in elections until they have successfully rejoined the cluster. I believe this risk for data loss has not been identified or understood previously and hopefully engineers who implement distributed databases will find this blog post helpful in preventing such data loss. To be clear: fsyncing commits to disk is not necessary to prevent data loss in case of minority of nodes failing.