There has been an increase in the discussion about MySQL replication and how to make it happen faster. I don't think Marco Tusa's blog is on Planet MySQL, so let's highlight it first: A dream on MySQL parallel replication. This is a good account of what it looks like out in the field of MySQL databases today - you are increasingly aware of replication, fearing it will be a bottleneck you cannot fix.
For those of us who have worked with MySQL Cluster, there's nothing new here. Ever since MySQL Cluster started using MySQL's row based replication for replication between two separate clusters (such as to different data centers), this has been a bottleneck. Even just a small 2 node in-memory cluster is limited by the replication being single threaded - I remember something like hitting this bottleneck at 25% of the real capacity. With larger clusters 16 nodes, 24 nodes... it of course became completely impossible to use it.
Today, as MySQL scales great on modern multi core hardware (I still haven't fully digested Percona proclaiming that the days of sharding are effectively over (PDF)) we are increasingly limited by this also on normal InnoDB installations.
This topic was my pet peeve while still at Sun. I managed to get the replication team to design a solution and do some prototype code. I haven't heard that anyone would be actively working on this anymore. Too bad if they've really given up.
Percona announced plans to implement a very similar solution within the native MySQL replication.
What I don't like, and what Marco laments about in his dreams, is that both the Tungsten solution and Percona's yet-to-be-implemented solution implement the parallelization by splitting each database (schema) into its own thread. This is great if you are a web hosting company and have hundreds of small databases in one MySQL instance. But me and Marco are not web hosting companies. Typically what we have is one big and busy database.
Update 2010-03-24: Robert explains in below comments and a new blog post that parallelizing based on different schema is just one obvious possibility, but the Tungsten framework allows you to define the sharding in other ways too with new plugins. That makes it immediately more interesting! (For instance, many applications that are bascially key-value'ish could be sharded - at least to a large degree - just by taking a hash on the primary key.)
All of the above implementations have started with worrying about how to keep consistency on the slave? Which transactions can be applied independently and which must be applied in the original order?
Marco offers the idea that the binary log could save the information of which MySQL user committed the transaction and then assume that different users' transactions are independent. That's probably a safe assumption. The problem is that most applications just use one and the same user for all their transactions (as Stewart pointed out today).
What I've been trying to promote all the time is that you guys should stop worrying and just implement something that allows the slave to commit transactions out of order. This should be the simplest to implement (just distribute incoming transactions round robin), and gives the biggest benefit. What everyone seems to be forgetting that the concept of a transaction already is the answer to the question: a transaction is a series of operations that need to be executed atomically together, but is (relatively) independent of other transactions.
I can easily list several use cases which can tolerate such out-of-order replication:
- Batch load of 100 million rows into a table. I totally don't care about the order you put those rows there, as long as they get into the database, asap! Today this can make your replication lag for hours.
- Anything that is a session database, shopping cart, etc. Each user is represented by one row in the table. Each row is independently updated.
- Even more complex scenarios that in theory aren't fully independent transactions, tolerate out of order applying in practice. Imagine you developed Facebook: A user can post a status, and others can post comments to it. Suppose a spammer posts a comment 1 millisecond after you posted your status. Suppose then the comment is replicated faster than its parent status. OMG the database is now in an inconsistent state!!!! Typically the application would probably just ignore the comment as long as the parent status is missing. Once the parent status arrives, also the comment will be found to belong to it.
- In the dead MySQL design I gave even used the classic bank example: If I transfer 100EUR to you, this is one transaction. It doesn't matter to us at what point in time this happens relative to other transactions. Otoh if I donate 100EUR to charity and in the same second you win 100EUR on the lottery, it is completely irrelevant to either of us which one of these really happened first. In all of these cases the total amount of money in the bank is constant, and consistency of each account is guaranteed. (Actually, I can now think of some bad examples that might prove myself wrong - it would depend on how details are implemented.)
- Any application that allows you to write to more than one master (such as the geographically nearest data center) is by design already tolerant of out of order commits. If at the same microsecond we commit transactions T, U, V in the US data center and transactions X, Y, Z in Europe data center, then there is no single truth to which of these are first and which later. It all depends on which continent you live on. So why should the replication be concerned with such questions either?
At last week's Meet Up in Helsinki (Heikki Tuuri was speaking, he develops some open source database software here) I met Seppo Jaakola from Codership. They have apparently implemented out of order parallel replication for Galera.
In addition to parallel replication Galera also has - say it with me - synchronous multi-master replication, global transaction id's, conflict resolution.
I'm now actively looking for an excuse to test Galera as part of some evaluation.