Parallelizing MySQL replication slave - a dream come true?

There has been an increase in the discussion about MySQL replication and how to make it happen faster. I don't think Marco Tusa's blog is on Planet MySQL, so let's highlight it first: A dream on MySQL parallel replication. This is a good account of what it looks like out in the field of MySQL databases today - you are increasingly aware of replication, fearing it will be a bottleneck you cannot fix.

For those of us who have worked with MySQL Cluster, there's nothing new here. Ever since MySQL Cluster started using MySQL's row based replication for replication between two separate clusters (such as to different data centers), this has been a bottleneck. Even just a small 2 node in-memory cluster is limited by the replication being single threaded - I remember something like hitting this bottleneck at 25% of the real capacity. With larger clusters 16 nodes, 24 nodes... it of course became completely impossible to use it.

Today, as MySQL scales great on modern multi core hardware (I still haven't fully digested Percona proclaiming that the days of sharding are effectively over (PDF)) we are increasingly limited by this also on normal InnoDB installations.

This topic was my pet peeve while still at Sun. I managed to get the replication team to design a solution and do some prototype code. I haven't heard that anyone would be actively working on this anymore. Too bad if they've really given up.

Giuseppe and Robert from Continuent have been informing us about Tungsten providing parallel replication - and rightly so!

Percona announced plans to implement a very similar solution within the native MySQL replication.

What I don't like, and what Marco laments about in his dreams, is that both the Tungsten solution and Percona's yet-to-be-implemented solution implement the parallelization by splitting each database (schema) into its own thread. This is great if you are a web hosting company and have hundreds of small databases in one MySQL instance. But me and Marco are not web hosting companies. Typically what we have is one big and busy database.

Update 2010-03-24: Robert explains in below comments and a new blog post that parallelizing based on different schema is just one obvious possibility, but the Tungsten framework allows you to define the sharding in other ways too with new plugins. That makes it immediately more interesting! (For instance, many applications that are bascially key-value'ish could be sharded - at least to a large degree - just by taking a hash on the primary key.)

All of the above implementations have started with worrying about how to keep consistency on the slave? Which transactions can be applied independently and which must be applied in the original order?

Marco offers the idea that the binary log could save the information of which MySQL user committed the transaction and then assume that different users' transactions are independent. That's probably a safe assumption. The problem is that most applications just use one and the same user for all their transactions (as Stewart pointed out today).

What I've been trying to promote all the time is that you guys should stop worrying and just implement something that allows the slave to commit transactions out of order. This should be the simplest to implement (just distribute incoming transactions round robin), and gives the biggest benefit. What everyone seems to be forgetting that the concept of a transaction already is the answer to the question: a transaction is a series of operations that need to be executed atomically together, but is (relatively) independent of other transactions.

I can easily list several use cases which can tolerate such out-of-order replication:

  1. Batch load of 100 million rows into a table. I totally don't care about the order you put those rows there, as long as they get into the database, asap! Today this can make your replication lag for hours.
  2. Anything that is a session database, shopping cart, etc. Each user is represented by one row in the table. Each row is independently updated.
  3. Even more complex scenarios that in theory aren't fully independent transactions, tolerate out of order applying in practice. Imagine you developed Facebook: A user can post a status, and others can post comments to it. Suppose a spammer posts a comment 1 millisecond after you posted your status. Suppose then the comment is replicated faster than its parent status. OMG the database is now in an inconsistent state!!!! Typically the application would probably just ignore the comment as long as the parent status is missing. Once the parent status arrives, also the comment will be found to belong to it.
  4. In the dead MySQL design I gave even used the classic bank example: If I transfer 100EUR to you, this is one transaction. It doesn't matter to us at what point in time this happens relative to other transactions. Otoh if I donate 100EUR to charity and in the same second you win 100EUR on the lottery, it is completely irrelevant to either of us which one of these really happened first. In all of these cases the total amount of money in the bank is constant, and consistency of each account is guaranteed. (Actually, I can now think of some bad examples that might prove myself wrong - it would depend on how details are implemented.)
  5. Any application that allows you to write to more than one master (such as the geographically nearest data center) is by design already tolerant of out of order commits. If at the same microsecond we commit transactions T, U, V in the US data center and transactions X, Y, Z in Europe data center, then there is no single truth to which of these are first and which later. It all depends on which continent you live on. So why should the replication be concerned with such questions either?

At last week's Meet Up in Helsinki (Heikki Tuuri was speaking, he develops some open source database software here) I met Seppo Jaakola from Codership. They have apparently implemented out of order parallel replication for Galera.

In addition to parallel replication Galera also has - say it with me - synchronous multi-master replication, global transaction id's, conflict resolution.

I'm now actively looking for an excuse to test Galera as part of some evaluation.

Hi Henrik,

Thanks for the shout-out. We have worked very hard for it. :)

Tungsten can do out-of-order replication--all it requires is a new implementation for the Partitioner interface, which assigns the replicator channel. There are some details about restart but if your transactions are truly order-independent you can sprinkle them joyously across multiple queues so long as the algorithm you use always uses the same channel whenever it sees the query. This is necessary for restart to work. There are some other little details like that to have a workable solution.

However, the real problem is that true out-of-order use cases seem very rare. Even if it works at the DBMS level, applications often get sick if they see things unexpected order.

Cheers, Robert

Well, you are the first ones to ship a solution, you deserve it! (I'm possibly doing injustice to Galera with that statement, in any case you are the first ones to blog about shipping a solution!)

Good point about the restarts. With graceful shutdowns you can of course always make sure that whatever algorithm you may be using saves it states in a way that it knows how to correctly resume. Unexpected restarts are harder.

But don't you also have conflict resolution? With row based replication and timestamp (or any monotonously increasing transaction id) based conflict resolution function it doesn't matter if I try to apply the same transaction twice, since it will be discarded when it realizes that the timestamp/other id isn't greater than the one that already exists in the database.

I will stubbornly refuse to believe that I'm the only one whose applications don't break with eventual consistency slaves. In history we've seen do crazy things for better performance, like using non-crash safe storage engines, even today people use MongoDB, relaxing durability properties for InnoDB... Compared to this, worrying about out of order commits seems like a rather small detail to me!

That reminds me, I should have added point 5 to the list above:

5. Any application that allows you to write to more than one master (such as the geographically nearest data center) is by design already tolerant of out of order commits. If at the same microsecond we commit transactions T, U, V in the US data center and transactions X, Y, Z in Europe data center, then there is no single truth to which of these are first and which later. It all depends on which continent you live on. So why should the replication be concerned with such questions either?

Jonas Oreland (not verified)

Tue, 2011-03-22 08:53

Hi,

A comment about ndb. In a sense, slave applier is already parallel for ndb.
Cause the slave applier will essentially only unpacking the records, and sending
them to the data-nodes. And the data-nodes will "apply" them in parallel.

Since we spent quite a lot of time on making the "send update"-part of ndbapi cpu efficient,
the single thread doing this...is actually less of a bottleneck than what one might think.

Sure, in syntetical benchmarks...one can make it a bottleneck...
But I have yet not seen it become a problem with a load generated from an application...

/Jonas

Note: In sentence above I use phrases "yet" and "I have not"

But what you say doesn't make sense. If what you say is true - that the single threaded slave applier is not a bottle-neck but just passes on work to data nodes, then it should be possible to do as much geo-replication as the network link can bear. As long as the receiving cluster has as many nodes as the master cluster, it is of course able to handle the same load. Yet this is not what happens in practice.

As for syntetic benchmarks: To demonstrate this you of course need a write-heave workload. But since that is what we always recommended Cluster for, it is not unusual to have such workloads. The only reason you don't see this as a problem is that customers have to immediately work around this limitation, such as give up on geo-replication, or stick with small 2 node clusters and do application level sharding, the very thing MySQL Cluster is supposed to do for them!

1) I don't understand what you mean by "not make sense"

2) "just passes on", this is exactly what it does...you can look at the code...it's GPL you know
that said "just" *can* be (made a) bottleneck.

3) The HLR/HSS applications what run on cluster have a very write-heavy workload. (for some definition of "very")
Maybe you never learned about the "slave-allow-batching" switch...

My conclusion is (at least so far) that
- multi-threaded slave (or multi-process) is mainly a "paper-product" when discussing cluster
- single threaded slave applier can keep up with very high write load, as very little work
is performed there...(most is in data-nodes)

Nothing in this blog has made me change my mind...
A real application benchmark, with a realistic setup could...

/Jonas

1) I don't understand what you mean by "not make sense"

We have 2 clusters A and B. They are identical in HW and configuration, so should have equal performance.

On Cluster A we were using 100+ (client) threads to apply transactions that consist of single row updates. The Cluster can handle about 12000 writes/second. (With 2008 hardware and release of MySQL Cluster.)

On Cluster B we have a single (slave sql) thread that applies the same transactions. As you've explained, it applies the single row writes in one large transaction which is sent in one batch to the data nodes. It can do 4000 writes/second.

Since all else is equal on the two clusters, it is a logical conclusion that the way the replication sql thread works is a bottleneck (but see below for more comments). To say that it is not a bottleneck doesn't make sense.

(I may remember numbers wrong as I don't have any notes saved from that engagement, but they shouldn't be too far off.)

2) "just passes on", this is exactly what it does...you can look at the code...it's GPL you know
that said "just" *can* be (made a) bottleneck.

Not looking at the GPL code but trusting what you are saying + my prior knowledge of how this works...

Perhaps you are right and it is not the replication sql thread that is the bottleneck. Well, kind of, but not alone.

So what happens is that a large batch of the original update transactions are batched into one big transaction and received on the slave cluster. The slave sql thread passes this down to the data node. So now we have the Transaction Coordinator on one datanode working through this batch. The TC on the other datanode (or all other, in a larger cluster) is doing nothing. In contrast, on the master cluster the transactions are evenly handled by 2 TC's on 2 datanodes. If the TC is the bottleneck, then this alone could explain a 50% reducation in max throughput.

Then you also have other factors, in particular on the master all updates are correctly routed to the data node that is the master of the relevant partition, whereas on the slave cluster 50% of the updates need one extra network hop from the one and only TC to the correct data node.

The individual row operations are then done in parallel, but not the selection/work of the TC.

3) The HLR/HSS applications what run on cluster have a very write-heavy workload. (for some definition of "very")

Look, first you say the slave sql thread is not a bottleneck, then you say that maybe it is but only for some unimportant use cases. Make up your mind!

It's not just HLR/HSS but most applications I saw in telecom were write intensive. Pre-paid and billing also has a 50-50 r/w ratio and as many tps as you can handle. Various voicemail and SMS/MMS gateways probably too, I don't remember hearing exact numbers. Since these are the very applications Cluster was targeted for, it is a bad excuse to say that these are "unusual" compared to some read-only website workload. Anyway, in my work with Cluster these were the apps we were dealing with. And geo-replication was always a problem.

In that work 50-50 ratio was average, not unusual. I once met with a big customer (and by big customer I refer to the person as much as the company, you know who it is) who wanted to use Cluster data nodes for a use case that had 95% writes and <5% reads! Now, this may sound crazy since it means you write a lot of data to the Cluster that is never read or used!! But it was supposed to function as a backup node for the non-MySQL main database.

And guess what, since the point was to offer backup functionality and redundancy, guess what would have been very nice to have? Geo-replication! But in this case the big customer already knew this was not possible and since he was friendly with us he had silently made sure data center redundancy was not on the list of requirements.

Like I said: The only cases where geo-replication was not a problem was when it wasn't used.

Maybe you never learned about the "slave-allow-batching" switch...

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-starti…

According to the manual that has existed since July 2007, so I'm sure we used it in 2008. (In fact, I remember Massimo checking this one.)

My conclusion is (at least so far) that
- multi-threaded slave (or multi-process) is mainly a "paper-product" when discussing cluster
- single threaded slave applier can keep up with very high write load, as very little work
is performed there...(most is in data-nodes)

So maybe it is the TC, and the fact that one of them is not doing any work, that is a bottleneck?

Nothing in this blog has made me change my mind...
A real application benchmark, with a realistic setup could...

Is HLR a realistic application?

Too bad I don't work much with Cluster anymore. I'm only interested in this topic because I'm increasingly hearing the same replication slave sql thread is now becoming a problem for InnoDB users too.

Jonas,

This is interesting, and somehow understandable also, as ndb is mostly memory resident, and applying is inherently fast operation.

We have been experimenting with parallel applying on InnoDB for quite some time now (since 2008 iirc), and with InnoDB there seems to be absolutely no gain when adding applier threads in the replication. But this is true only as long as database size is reasonably small. Then, when the "active data set" does not fit in the buffer pool anymore and page replacement kicks in, then applying will get very slow. Parallel applying is one effective method of speeding up the applying process under heavy InnoDB buffer pool paging. In some cases, we have gone up to 16 appliers, when tuning best possible InnoDB applying speed.

Sebastian Stach (not verified)

Tue, 2011-03-22 12:11

Hi,

Could you write something more about this limited replication in 2 node cluster ?

Cheers, Sebastian

Henrik, thanks for the interesting blog post, your blog is growing to be one major MySQL related information hub.

Galera replication has out of order committing (OOOC) as one possible replication mode. It is not a generally usable mode and we keep it as an optional feature. OOOC can cause issues .e.g. with foreign keys and triggers. Something may fire, although it shouldn't...

Galera applying is based on knowing the replicated transactions in detail. This information includes global transaction ID and the set of affected rows. And then, Galera applies in parallel such transactions, which don't depend on each other. In practice, quite a lot of concurrency can be allowed.

Galera 0.8 launches a pre-configured number of applier threads during mysqld startup (4 by default).

ok, I probably forgot to mention, that Galera can also control the commit order. Galera can apply many transactions in parallel but hold the committing to happen in strict order dictated by global transaction IDs. The decision to allow parallel applying is based on looking at row level access in the transactions.

And it turns out that sheer applying in parallel can be effective (mostly due to slow InnoDB paging). OOOC will add more performance to the applying, as there will be no wait to commit. But OOOC should be optional as not all applications can deal with it. .e.g. the use cases you listed above, should be safe to use OOOC.

Hi Henrik,
Interesting discussions.

Regarding applications coping with out-of-order commits, inconsistency etc, I would say that there are lots of examples of databases / replication solutions that can offer 'eventual consistency', but not so many *real* examples of applications that can tolerate it happening to their data.

It would be interesting to read some material on the proper design of applications to tolerate stale, temporarily inconsistent data. Perhaps some general rules / patterns can emerge which, if followed, allow these optimisations at the data layer. A kind of BASE application design 101.

Regarding Cluster, I think that anyone looking at PPT can realise that georeplication is *conceptually* a bottleneck as it has a number of single threaded components. Whether it is currently a *real* bottleneck for users is still debatable. If it is a bottleneck, then whether thread parallelism of the applier is the answer is also debatable.

Frazer

@Henrik
Good discussion! I wrote an article about the virtues of shard-based replication as it deserves a separate discussion. Please feel free to dispute my claims. (http://scale-out-blog.blogspot.com/2011/03/parallel-replication-using-s…)

@Seppo, it would be good to see a blog article on Galera parallelization backed up by some numbers if you have them. Automatic parallelization that mimics the master behavior does not seem very beneficial. However given that you are using certification it seems you just might be able to pull this off. I'm prepared to be impressed. :)

Yep, I saw it. I understand your thinking better now. Generic shard base replication is much more useful than schema level replication. Many web and session kind of application are already doing hashing by primary key and sessions are independent. That's actually cool!

I'm still the naive optimist here - unlike most of the people commenting I'm not actually trying to implement anything like this :-) So I'm naively thinking that

- conflict resolution will take care of all restart problems (you can just go back to a known good position and conflict resolution will discard transactions that were already done).

- deadlocks aren't any bigger problem than they are for the application on the master side. If you hit a deadlock you retry the transaction.

- ordering: now this is an interesting one! I don't know how a replication system should react when it encounters update before insert, or update after delete for that matter. In theory this is what conflict resolution needs to address too, but this just got much harder than just sorting out updates between each other.

But the a general shard based approach is actually pretty useful. This is good to know.

I'm adding this here just because this comment thread ended up collecting most of our current understanding on parallel replication, seems worthwhile to keep it complete.

http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184
Parallel replication of group-committed transactions

Executive Summary:
Kristian has a way to do parallel replication without sharding by schema or otherwise (proving you wrong).
Kristian is smarter than all of us. (...seriously, I really admire his work.)

Kristian's designs are a pleasure to read. I like this one because it uses a sound method of identifying order-independent transactions. The interesting question is how much parallelization of group committed transactions actually helps overall throughput. Even a small amount of variability in transaction duration will result in smaller transactions effectively serializing around the longest transaction in the group. For many statement-based workloads this will lead to 100% serialization.

Ok, so to some extent I'm guilty of living in a world of telecom and web applications where all transactions are simple and short lived operations and there are always plenty of them.

But even with longer running transactions interleaved... This will still always be faster than just the naive single threaded application we have now. Now we have SUM(trx duration), when replicating per group commit we have MAX(trx duration).

Also, the group commit base parallelization is complementary, for instance a default setting could use both that *and* schema based parallelization and this would then be efficient for a wide range of use cases: Many schemas but few trx per second per schema, or few schemas but high trx rate inside the schema -> you win either way.

Sudheer (not verified)

Fri, 2011-04-29 11:51

Hi Team,

We are facing one production issue after enabling replication.

Production Description:

We have one MySQL server with DRBD replication after enabling MySQL Replication the server is getting slow and application request queue is increasing. If we disable MySQL replication application works fine and application request queue is not increasing.

I checked OS level like CPU, RAM and Disk I/O, everything works well.

How to debug this issue?

Is a problem with MySQL replication with DRBD?

Why Bin-logs file size very huge(220 GB) for 10 GB Database(per day)?

Addition details after enabling replication:

CPU utilization: 8% max

Ram: 30 % out of 32 GB

Disk I/0: Nrmarl

CPU details:

RAM : 32 GB

CPU : 16 Cores

OS : Linux Red Hat 5.0

Mysql Deatil:

Version: 5.1

Database size : 100 GB

Datebase size per day: 10 GB

Bin-log size: 1 GB bin-log file creating for every 6 minutes/ 9 GB per hour

Binlog_format: Statement

DMC on MySQL server: 80 % insert statements , 18% update statements and rest are select statements.

Add new comment

The content of this field is kept private and will not be shown publicly. Cookie & Privacy Policy
  • No HTML tags allowed.
  • External and mailto links in content links have an icon.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Each email address will be obfuscated in a human readable fashion or, if JavaScript is enabled, replaced with a spam resistent clickable link. Email addresses will get the default web form unless specified. If replacement text (a persons name) is required a webform is also required. Separate each part with the "|" pipe symbol. Replace spaces in names with "_".
About the bookAbout this siteAcademicAccordAmazonBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube

Search

Recent blog posts

Recent comments