The paper stating the RUM conjecture was published by a group of Harvard DASLab researchers in 2016. They also have created a more easily digestable RUM conjecture home page with graphics. Yet, in this blog post I try to describe the idea in even simpler terms than that page.
An engineer I work with asked me for tips on what to read about database benchmarking. I told him I've learned a lot from reading Mark Callaghan's blog. Now that I think about it, articles and conference talks from Baron Scwhartz were also, or even more, fundamental early on when I was getting started.
When I choose technologies to use, or employers to work for, my system is based on sticking with a few things I believe in. Datastax happens to tick quite a few of those boxes:
I overheard - over-read, really - an internet discussion about database storage engines. The discussion was about what functionality is considered part of a storage engine, and what functionality is in the common parts of the database server. My first reaction was something like "how isn't this obvious?" Then I realized for a lot of the database functionality it isn't obvious at all and the answer really is that it could be either way.
My kids watch a lot of youtube. They follow the famous Finnish youtubers every week. At some point my son had realized there are many videos on youtube with his father doing conference talks. Some of them have a thousand viewers. I've never gotten so much adoration and respect from my son as that day!
I've created a playlist of all my conference talks that have been published on youtube.
Mark Callaghan pointed me to a paper for my comments: Strong and Efficient Consistency with Consistency-Aware Durability by Ganesan, Alagappan and Arpaci-Dusseau ^2. It won Best Paper award at the Usenix Fast '20 conference. The paper presents a new consistency level for distributed databases where reads are causally consistent with other reads but not (necessarily) with writes.
My comments are mostly on section 2 of the paper, which describes current state of the art and a motivation for their work.
A task that I've done many times in my career in databases is to load data into a database as a first step in some benchmark. To do it efficiently you want to use multiple threads. Dividing the work onto many threads requires good comprehension of third grade math, yet can be surprisingly hard to get right.
The typical setup is often like this:
Here are the slides for my HighLoad++ talk tomorrow:
Previously in this series: Reading about Serializeable Snapshot Isolation.
Last week I took a deep dive into articles on Serializeable Snapshot Isolation. It ended on a sad note, as I learned that to extended SSI to a sharded database using 2PC for distributed transactions1, there is a need to persist - which means replicate - all read sets in addition to all writes.
This conclusion has been bothering me, so before diving into other papers on distributed serializeable transactions, I wanted to understand better what exactly happens in SSI when a node (shard) fails. This blog doesn't introduce any new papers, just more details. And more speculation.
© 2006-2020 Henrik Ingo.
The content on this site is published with the Creative Commons Attribution License.
That means you are free to copy and reuse and redistribute the book, blog posts and other original content you find on this site.
Non-original content will be clearly attributed with their respective copyright terms.
Designed by: Golems G.A.B.B. OÜ