Previously in this series: Reading about Serializeable Snapshot Isolation.
Last week I took a deep dive into articles on Serializeable Snapshot Isolation. It ended on a sad note, as I learned that to extended SSI to a sharded database using 2PC for distributed transactions1 , there is a need to persist - which means replicate - all read sets in addition to all writes.
This conclusion has been bothering me, so before diving into other papers on distributed serializeable transactions, I wanted to understand better what exactly happens in SSI when a node (shard) fails. This blog doesn't introduce any new papers, just more details. And more speculation.
Single node failure
To get a grip on things, let's first ask ourselves what would happen to a bunch of serializeable transactions - whether they be SSI or S2PL - on a single node database that fails and restarts?
The common implementation of course is that all uncommitted transactions are aborted or lost when the node fails. This leaves us with a transaction history where all transactions either committed before the failure or started after it. Before and after the failure concurrent transactions are enforced to be serializeable by way of the algorithm used (SSI, S2PL...) and the failure itself is inherently serializeable because no transaction could be concurrent with it.
Failure in 2PC sharded transactions
So how is 2PC different? Let's start with the obvious: to complete a write transaction, all participating shards must not fail until the transaction has successfully committed on all of them. (In 2PC this actually means successfully prepared...) This needs no explanation, it is how databases normally work: They persist writes at commit time.
This leaves us with the question of reads. What if a shard failed and a transaction T had only done reads from that particular shard? And what if a serializeable transaction T_ro was purely read-only?
The answer is that also read-only shards need to continue to work until commit. (And by commit I mean prepare, again.) The simple explanation is just that (in SSI) we need to have both the read sets and write sets at the commit of a transaction. Essentially serializeable transactions must also commit their reads! (This is in fact how serializeable works: The reads are only confirmed to have been serializeable once the commit succeeded. And btw, even snapshot isolation in MongoDB has a similar rule, because requiring the commit to succeed allows a speculative read optimization.)
So both write and read transactions must commit successfully to be serializeable. Why then does the PostgreSQL paper say that they had to persist also the read sets to ensure correct serializeable behavior in 2PC? The paper doesn't go into detail, the answer seems to be that that was how PostgreSQL 2PC already worked and when implementing SSI they also added the read set locks to that function.
When I tried to think about this, I realized that maybe the PostgreSQL implementation is that way simply because "the" 2PC paper (from 1986!) requires that? (See section 2.1.) If this is the case, then we can just simplify the situation by extending the case of a single node database to a distributed 2PC failure: If we just abort all transactions concurrent with a failure on any participating shard, then there wouldn't be any transactions that could possibly conflict with these locks anyway. In case of failure, all unprepared transactions would abort, and all prepared transactions will be committed (except of course if the TC had decided to abort the commit) and no new transactions could start before the prepared ones are committed. This is the same as saying that the write sets and read sets that we didn't persist were replaced with the less granular lock that is to lock the entire shard until it has performed necessary recovery operations. I believe this rule would avoid the need to persist read sets.
(So I guess I did introduce a new paper after all. And quite a classic! A relevant summary here of the 1986 R* paper is that it describes a very different 2PC than anyone would want to use today. Network failures were common and a design goal was to allow local transactions during disconnects. In case the TC fails, the system wasn't very self healing, and the locks between prepare and commit could end up being held for days while operators call each other by telephone to agree whether to force the prepared transactions to an abort or commit. Needless to say, that is not how we actually build modern 2PC systems.)
But what about the transaction coordinator?
In the real world case of MongoDB, a TC for each transaction runs on the mongos router serving the transaction. As I sketched in the previous post the TC is responsible for keeping track of all those rw-antidependencies. So assume we find that dangerous structure:
T1 -rw-> T2 -rw-> T3
T1 is already committed. Further, assume that T2 has already committed successfully, but then the mongos crashed and took the TC for T2 with it. Now T3 is ready to commit. On some shard it will find that it has a write that conflicts with a read from T2, which already committed. But now it can't connect with the TC for T2 anymore, so it doesn't know whether T2 also had an incoming rw-antidependency.
The PostgreSQL paper actually answers this question. They don't persist the dependencies either, so in this case T3 must conservatively assume the answer is "yes" and therefore it must abort.
Summarizing, we end up with rules for when a serializeable (SSI) transaction would have to abort in a sharded cluster (like MongoDB).
- If a shard that received a write fails before prepare (obviously)
- If a shard that served a read fails before prepare
- If the TC can't connect to other TCs with which there was a rw-antidependency
This means that SSI transactions would be more affected by node failures (whether shard primary, or mongos) than SI transactions. But that seems reasonable. After all, even in the absence of node failures, serializeable transactions are more likely to be aborted anyway, compared to SI or RR. This is the price to pay for better isolation level.
- 1for example, MongoDB