Previously in this series: Toward strict consistency in distributed databases.
Apparently, when Snapshot Isolation was invented, it was received with great enthusiasm, since it avoids all the anomalies familiar from the SQL standard. It was only as time went by that people discovered that SI had some completely new anomalies and therefore wasn't equivalent to serializeable isolation. (This I learned from the PostgreSQL paper below.)
A paper referenced by all below papers is: Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions by Atul Adya.1 Note that it is from 1999 - same year as the CAP theorem. So while the world's attention was turning to web scale, a full understanding of isolation levels for good old relational databases was still developing toward what we currently know.
Adya's work helped narrow the understanding of what kind of anomalies can happen in SI that violate Serializeability. Michael Cahill, Uwe Röhm and Alan Fekete used this to invent Serializeable Snapshot Isolation (2008). The algorithm targets a more specific dependency between concurrent transactions than a straightforward "lock all records that were read" read (aka S2PL). This significantly decreases the number of transactions that need to be aborted. Which is good both for scalability and developer happiness.
If you start from Snapshot Isolation, the remaining anomaly that happens is write skew. By way of example, and a bit simplified from Cahill: Imagine a clinic with 2 doctors. You are allowed to check out when the other doctor is on call. The following transaction is used to ensure that one of the two doctors is always on call:
SELECT oncall FROM doctors WHERE name = "Bob"; IF oncall = true THEN UPDATE doctors SET oncall = false WHERE name = "Alice"; COMMIT;
(And swap Alice/Bob for when Bob checks out.)
With Snapshot Isolation, if both Alice and Bob check out at the exact same time, both of their transactions will succeed! Since there is no write conflict, SI is satisfied. The clinic however, is now without a doctor!
The dependency that needs to be prevented is called a read-write dependency or also anti-dependency. More specifically, Cahill's algorithm builds from a proof that any non-serializeable dependency graph has two consecutive rw-antidependencies. This is called a dangerous structure:
T1 -rw-> T2 -rw-> T3
The algorithm is focused on detecting T2, which has both an incoming and outgoing rw-antidependency. Aborting T2 guarantees that the other transactions are serializeable! Note that T1 and T3 can be the same transaction, such as in the example above.
The paper presents an implementation for BerkeleyDB, with benchmarks.
While the SSI algorithm is a significant improvement, the BerkeleyDB implementation was a bit crude. For one thing, it used page level locking instead of record level. To keep the space needed by the algorithm constant, it doesn't account for each record that had a rw-dependency, rather there's just a flag marking the fact that some incoming and some outgoing rw-dependencies were detected. This still causes some false positives in aborted transactions, even if the rate is much less than the previous state of the art used to be.
Others have therefore improved on this, to provide more accurate abort decisions. This presents an interesting "accounting overhead vs false positives" trade off. I very much enjoyed reading Performance of Serializable Snapshot Isolation on Multicore Servers by Hyungsoo Jung & 4 collaborators between Seoul and Sydney, which benchmarked several SSI implementations on InnoDB and Postgres against each other.
Famously, PostgreSQL 9.1 introduced Serializeable isolation which is based on SSI. The paper by Dan Ports and Kevin Grittner describes several optimizations they have invented during implementation: They 1) reorder (delay) commits to try to avoid the need to abort anything in the first place and 2) try to select the transaction to abort such that a retry at least will not cause the same conflict again. They identify safe snapshots where a read-only transaction is guaranteed to not conflict with any current or future transaction. This is important for long running reads like
SELECT COUNT(a) FROM sometable. (Note that a backup could just use Snapshot Isolation.)
All in all, a very impressive piece of work! If you know more about PostgreSQL than I do, it would be interesting to know whether users are choosing Serializeable for their apps now?
SSI for replication
When I started asking around about SSI for distributed databases, Michael was kind enough to point me to Serializable Snapshot Isolation for Replicated Databases in High-Update Scenarios, also by Hyungsoo Jung et.al. It is a well written and easy to follow paper, but to me was rather unsurprising. It is basically a similar architecture like Galera, just upgrading SI to SSI. Under the hood the difference is that where Galera replicates write sets, this paper also replicates read sets (which can potentially be huge!).
Btw, I found the intro and notation in this paper easier to read than Cahill's original paper, so this paper actually helped me in understanding SSI better just in general.
The paper describes a multi-master replication cluster. Since MongoDB uses single-primary replication, implementing replicated SSI is essentially the same as for a single server. (Although: The Postgresql paper above describes an idea to allow read-only transactions on secondaries that are still serializeable with the transactions on the primary!)
SSI for sharding
But is SSI possible for sharded clusters? It seems there's at least nothing published on this. So I have speculated on my own a bit...
To begin, we should observe that the read-write antidependencies are about a specific record: Two transactions that read and write the same record. So the good news is that these dependencies can be locally observed on each shard!
Then what? When it is time to do two phase commit, all shards would have to report back their observations to the transaction coordinator. This creates some new traffic between nodes, but it seems this overhead is capped by the amount of records written. Or more correctly, the intersection of records written and read.
MongoDB cross shard transactions use two phase commit. A decision to abort a transaction due to an SSI dangerous structure would have to be done before or during the prepare phase, since after the prepare phase has succeeded the transaction can no longer be aborted in the commit phase. (This is how 2PC works.) A problem here is that at the start of the prepare phase the transaction does not yet know what it's commit timestamp will be. (Each shard prepares the transaction with whatever timestamp it is at, and the "official" cluster wide timestamp is only known once the commit phase happens.) But for SSI we need to know the start and end timestamps for the transactions, otherwise we can't even know which other transactions are concurrent with this one!
So for SSI to be possible between shards, we would need to know the timestamp earlier. This would require another roundtrip from the TC to all shards: Knowing the actual end - and start - timestamp of the transaction, now check for rw-antidependencies with other concurrent transactions. Maybe this would be called 3PC? I'm not familiar with the cross-shard 2PC code, which is new code, so can't really armchair engineer this post to a more detailed solution than this.
The PostgreSQL paper above also describes how they implemented 2PC support. (I assume that...) Since PostgreSQL is a single server database, they don't suffer from this issue, because the commit state is trivially known at the start of the prepare phase.
However, they did have to solve another issue: In SSI, we need to keep a transaction's locking information (the read sets and write sets) around after it has committed. For 2PC, since successfully prepared transactions are guaranteed to commit, this means that if a server crashed between prepare and commit, the resumed transaction will have to recover its read and write sets before committing. In other words the read and write sets must be persisted, which for MongoDB means they must be replicated. This is bad: Instead of replicating all writes we suddenly have to replicate all reads too!
In practice both the BerkeleyDB and PostgreSQL implementations employ simplifications even in the single node case to avoid accounting overhead, or in the PostgreSQL case, triggered by DDL. The PostgreSQL paper calls these promoting a lock to higher granularity and summarizing. The promotion approach seems compatible with replication: If a transaction read a lot of records from the same collection, one could simply record a collection-level lock instead. Crucially, if a transaction didn't read any records, or only the ones it also wrote, this is good to know and could be recorded too. Using this kind of lower granularity lock information would of course lead to more false positives, but note that this would only impact transactions that are concurrent with a failover on a participating shard.
With my colleague in MySQL times, Bernd Ocklin, we both had a saying: It's software, so anything is possible. But the saying also implies that not every idea is meaningful to actually implement. The above is starting to sound very non-trivial indeed. Adding new stuff that needs to be replicated we'd rather avoid if possible. So as long as Serialization is a topic that real users aren't really asking for a lot, it would be unlikely for anyone to venture into this adventure. On the other hand this could be an interesting academic project for someone, as it is an interesting unsolved problem.
Next in this series: Node failures in Serializeable Snapshot Isolation, in which I argue that replicating the read sets isn't necessary after all.
- 1. No I didn't actually read this 198 page thesis. It just seemed worth listing for completeness.