Human readable resources on Paxos

Credit: Costas78 and Wikimedia

It's time to pick up our series on consistency and replication again. See consistency tag for the previous paper reviews on the topic.


I had thought I could live a happy life as a database engineer, and eventually retire, without ever having to actually read any literature on Paxos.

A couple times I have downloaded the PDF on Paxos, started reading, and... each time it quickly ends with a: "why on earth am I reading this, I would never implement this when there are better alternatives". (See Raft, Galera, 2PC...)

Well I got this far, but last week I finally could no longer avoid it. Cassandra's Light Weight Transactions are based on Paxos. Maybe one day a year from now, nobody will ever use them again, but for now, it's what we have, and I needed to understand how they work.

The Literature

Famously, Leslie Lamport wrote the original Paxos paper as an allegory of a fictional ancient Greek parliament, where the members need to reach consensus on a decision. Choosing such an artful style to explain a computer science algorithm caused the paper to only be published in a journal 8 years later, in 1998. And even then nobody understood it!

So in 2001 Lamport published a follow-up paper title Paxos made simple. The joke is that when you start reading it, it's not simple at all! This has led to a whole industry of articles usually titled "Paxos Made [Even More Simple than the Previous Paper You Tried to Read]".

The problem with Paxos can be broken into 3 parts:

  1. The original papers focus on correctness proofs, more than explaining how the algorithm works and how to implement it. This has led to a whole industry of publications trying to explain Paxos in a simple way. That is, more simple than "Paxos made simple".
  2. The original Paxos paper isn't optimal. It requires 2 round trips to agree a single value. Every other algorithm I've ever heard of can do better than that. (Well, except 2PC, but even for that one you can return early to the clients, so it's really just 1 round trip.) This has led to a series of papers publishing better versions of Paxos, such as Multi Paxos. However, when you try to find the paper on Multi Paxos, there isn't one. Turns out the Paxos made simple paper is the canonical reference for Multi-Paxos. (Yes, so Paxos made simple actually added to the algorithm, making it more complicated...)
    Paxos also isn't optimal from a simplicity perspective: It has a cast of actors called Proposers, Acceptors and Learners. In a real world implementation these are just all performed by the same node.
  3. The original Paxos paper also wasn't complete. It didn't address a whole bunch of issues you'd commonly want to see in database replication, or a replicated state machine. Again, Paxos made simple is your first stop to Lamport commenting on, for example, how to change cluster configuration, and how to implement a system that can process more than one transaction. But that was only the start! Perhaps the most cited "Paxos made X" follow-up papers are:
    Paxos Made Live - An Engineering Perspective describing a real world implementation at Google.
    Paxos Made Moderately Complex which, despite the self deprecating title, seems to be the most useful and complete paper I've seen so far. I could imagine someone reading this and building a good replication system based on it.

To be fair, the fact that the original Paxos paper is woefully incomplete if you wanted to actually build a working distributed database is not a problem unique to Paxos. Two-Phase Commit is one of the most used algorithms in distributed consensus, and I personally have great experiences with it in MySQL NDB Cluster and MongoDB. Yet when I first read Transaction Management in the R* Distributed Database Management System I was shocked to realize, that it doesn't describe anything resembling the modern 21st century systems I had been working with. Basically the original 2PC literature is focused literally on what it says on the tin: committing transactions that are resilient to failures. Remaining operational in the presence such failures, or even to some extent recovering transaction state after the failure, is left as an exercise for the reader.

Easy to read blogs

Ok, so as I indicated in the beginning, I didn't actually read any of those papers. I was on a search for some human readable explanation of how a Paxos implementation might actually work. And I found some!

The Morning Paper of course has covered Paxos in several episodes (Part Time Parliament, Simple, Live). But fundamentally these are still commentaries on the respective papers. I decided to look further.

Tom Cogane has IMO a useful introduction: Understanding Paxos. It does a great job of listing the various extensions a real world system would have to add, such as Negative ACK. In the end it also goes on to provide an overview of the most common variations, like Multi-Paxos.

Speaking of which, Beyond The Lines blog has a great explanation of Multi-Paxos.

As an interesting curiosity, I found a slide deck Implementing Replicated Logs with Paxos, from the authors of Raft and dated one year before they published the first paper on Raft! I suspect I have found the evolutionary missing link between Paxos and Raft?

What did I learn?

Ok, so I can finally say I now know how Paxos works. Was it worth it? No.

The problem is that there are so many variations of Paxos, that saying that a system is "based on Paxos" means absolutely nothing. Such a system may be leader based or leaderless, require 1 or 2 round trips, produce a complete transaction log or not... Basically, with these rules, any system can be described as Paxos based. (For example, given the above slide deck, we could apparently say that Raft is merely "Paxos Made Log Based".)

So, after spending a day reading up on the above blogs, I quickly realized I still know nothing about how Cassandra's Light Weight Transactions work. Bummer.

I ended up doing what I should have done in the first place: ask Benedict and Jake.

Vilho Raatikka (not verified)

Sat, 2021-12-11 16:18

Thanks for sharing the pain. You're not alone, I can tell (if it helps).

Are we going to see their complete answers in the follow-up post?

Ah right! Haha, that's a reasonable question, but no, I don't plan on becoming so expert on LWTs that I should authoritatively blog about it.

 

What I did learn as part of this deep dive:

* LWTs are leaderless / multi-master

* They currently require 4 round trips!! but CEP-14 will reduce to 2 for writes and 1 for read.

* (I already knew this) They are used to provide linearizability where Cassandra otherwise is eventually consistent dynamo style

* There isn't a single commit log but a merge of a majority of commit logs is a complete transaction log

* Also note that all of this is per partition.

I tried to understand the reason for 4 round trips but couldn't get it. It sounds exhaustive but I assume there's something they get in return; security, for example. On the other hand, cost of tackling byzantine problems depends on the number of nodes, I think, so I'm a bit confused. Perhaps I need to read it myself to find out.
Thanks for the pointers!

The 4 round trips fixes some corner case that had correctness/consistency issue. I haven't bothered to look into the details, but the point of CEP-14 is exactly to get back to an implementation that has 2 round trips (for writes) and still works correctly.

Add new comment

The content of this field is kept private and will not be shown publicly. Cookie & Privacy Policy
  • No HTML tags allowed.
  • External and mailto links in content links have an icon.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Each email address will be obfuscated in a human readable fashion or, if JavaScript is enabled, replaced with a spam resistent clickable link. Email addresses will get the default web form unless specified. If replacement text (a persons name) is required a webform is also required. Separate each part with the "|" pipe symbol. Replace spaces in names with "_".
About the bookAbout this siteAcademicAccordAmazonBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube