At work we were discussing whether MongoDB will retry operations in some circumstances or whether the client needs to be prepared to do so. After a while we realized different participants in the discussion were discussing different retries.
So I sat down to get to the bottom of all the retries that can happen in MongoDB, and write a blog post about them. But after googling a bit it turns out someone has already written that blog post, so this will be a short post for me linking to other posts.
Retries by the driver
If you set
retryWrites=true in your MongoDB connection string, then the driver will automatically retry some write operations for some types of failures. Ok, can I be more specific? Yes I can...
Retryable Writes is extensively documented in the MongoDB manual. Some highlights:
- As of last year this feature is on by default.
- The driver will automatically retry single write operations that are not part of a transaction. Bulk write operations are not retryable, except for bulk inserts.
- Write operations inside a transaction are not retryable. However, the commit and abort operations are.
- The types of errors that are automatically retried: network failures and primary failovers. In the latter case the driver waits for serverSelectionTimeoutMS milliseconds, so that it is likely that a new primary is operational when it retries.
Automatic retries of writes are possible when using MongoDB sessions from 3.6 release. Write operations in a session have a unique id, which makes them idempotent. If the write was successful the first time, the server just ignores the retry.
Retries by the storage engine API
WiredTiger storage engine implements MVCC for concurrency control, which is a form of Optimistic concurrency control. This means that write operations do not wait for locks at the start, rather writes are just executed, and if two operations try to write to a record at the exact same time, then the second attempt will fail. However, the idea is then that the client should retry the same write again, and it is likely to succeed the second time. (Or third time, etc...)
For a single statement write operation, the assumption is that the client would just retry the exact same write again. So it is possible for the MongoDB server to just do that internally, without bothering the client at all. And MongoDB does in fact do that. Clients never see these write conflicts, so most users aren't even aware of this.
In a multi statement transaction automatic retries are not possible. When a write conflict happens, the entire transaction is aborted, and the client needs to start the entire transaction from scratch, not just the single write operation that failed.
In general it is not safe to assume that the second attempt of the same transaction will be a replay of the first attempt. For example, in the classic case of transferring money from account A to B, let's say you check that there is 100EUR in A, but then the write to B fails. So you have to retry the transaction. But maybe in the mean time account A funds were depleted, so there is not money to transfer to B, and the transaction just have to give up after checking A.
Since MVCC write conflicts are retried inside MongoDB, the user or client application never sees it happening. Maybe for this reason this feature doesn't seem to be documented in the user manual. However, MuraliDBA has a great in depth blog post on the topic, including how to monitor the retries as they happen.
Retries on upserts
Somewhat confusingly, the user manual page on retries by the client also mentions retries that can happen for upserts. This is really an internal detail about the upsert feature, and never propagates to the client. It documents the behavior after fixing SERVER-14322. This was a favorite bug of mine for a long time, and I'm glad to see it fixed. The bug happened with upsert operations when MongoDB switched from MMAPv1 engine with collection level locking to WiredTiger with record level locking / MVCC.
While this retry is internal to upserts, the fix was indeed to retry the upsert on concurrency conflicts, so it is in scope for this blog post to mention.