A while ago Baron blogged about his utter dislike for MMM, a framework supposedly used as a MySQL high-availability solution. While I have no personal experience with this framework, reading the comments to that blog I'm indeed convinced that Baron is right. For one thing, it includes the creator of MMM agreeing.
Baron's post still suggests - and having spoken with him I know that's what he has in mind - that a better solution could be built, it's just MMM that has a poor design. I'm going to go further than that: Personally, I've come to think that this family of so called clustering suites is just categorically the wrong approach to database high-availability. I will now explain why they fail, and what the right way is instead.
Clustering suites = FAIL
There are quite a few solutions out there that are so called general clustering suites: Heartbeat 1 and 2, MMM, Solaris Cluster... there is something similar for Windows and several proprietary 3rd party solutions for the Unix world have existed. The newest kid in town is a quiltwork of components called Pacemaker, Corosync and Heartbeat scripts - this has apparently replaced Heartbeat as the canonical Linux solution. (You often see this solution combo called "Pacemaker", which I don't understand why, because it's really Corosync that you start as a daemon that then calls all the other components.)
The following picture shows how a database high-availability setup would be implemented with any of these clustering solutions:
The general approach is to setup 2 (or more) database nodes, with some form of replication. The clustering solution does not replicate the data for you, but it can often support your choice of MySQL replication, DRBD, SAN, etc... In startup, and failover situations it generally needs manage both the MySQL instance itself, and the way data is being replicated. It does this via a special MySQL agent which knows how to start, stop and monitor MySQL. Similarly these general purpose clustering solutions then ship other agents that can manage Tomcat, PostgreSQL and so forth...
So what these clustering frameworks do, is to probe at MySQL from the outside to determine if MySQL still seems to be healthy. In practice you always have a master-slave kind of setup where you are mostly interested in probing the master, but of course the framework is also doing some checks to verify that the slave / standby node is available, in case we should need it later.
If the clustering framework notices that the master MySQL instance is not responding, it initiates failover procedures to the standby node (the slave):
What exactly happens here depends on the setup: for MySQL replication you typically just move the virtual ip to point to the other node, where MySQL is already running. For DRBD and SAN you also need to mount disks and other related activities.
On a picture it seems simple enough. What could possibly go wrong?
To some extent I also share the criticism of Baron: I've now seen a few of these, and these things never convince me as being well designed. Commonly the agents are written as shell scripts similar to your traditional Unix init script. And I just don't think a complex thing like a high-availability solution should be implemented in bash. I recently had a not-so-pleasant encounter with this Pacemaker/Corosync quiltwork, and on a regular startup without errors it prints out 5-7 screenfuls of log messages. That's too much - most users will never find anything useful from such log files. The original xml based way of configuring it seems incomprehensible to anyone, so they've developed a second configuration syntax on top of the "native" one. And despite all those log messages, I still managed to make this thing silently fail on startup without any error message. And so forth... And high-availability is a complex topic - this is not just another text editor someone is developing. So issues like those generally don't give me the confidence that this thing will actually work in production and do the right thing when the network is down.
But that doesn't really matter. Because the main problem here is: no matter how well such a solution would be designed, it still wouldn't work. These things are hacks.
I mean, maybe if you want to use this for starting some stateless application servers, sure. If it detects that your tomcat or apache isn't answering, it can try to restart it for you, and if it fails it will try to start it on some spare node instead. Fine, I see how that could work.
But not for a database. If your MySQL instance crashes, it doesn't help if your choice of clustering software just starts a new instance somewhere else, or randomly moves a vip somewhere else. You actually need to replicate the data to begin with, and when failure happens, the failover needs to be done in a manner that protects your data integrity.
So here's what really happens inside these clustering solutions:
- At a specified interval - the heartbeat - the clustering solution will see if your MySQL instance is still running. The default heartbeat is often something like 10 seconds and minimum granularity tends to be one second. Problem: What happens between these seconds the clustering solution is completely unaware of. If you have 5000 trx/sec, you could have fifty-thousand failures before an attempt to fix the error is made.
- The clustering solution takes small peeks into MySQL from the outside, but other than that MySQL remains a black box. Problem: Say that there are network errors. This causes your transactions to fail, and it also causes replication to fail. So which one failed first? If transactions started failing first, and replication was still working until the end, then you are fine and you can fail over to the other node. But if replication failed first, then you will lose data if you fail over, because not everything was replicated. Your clustering software has no idea, it wasn't looking when the error happened.
- Typically the clustering software itself will have some communication going on, which has the benefit that it verifies that the network connection between nodes is ok. This is in itself useful, sure. Problem: But just like above, if the clustering software detects that network has failed, it's still mostly unaware of the state of MySQL. The failover decision is done blindly.
- In a typical setup like above, the clustering software is actually just checking whether there is a network connection between the two MySQL nodes. If yes, that makes it happy. Problem: Nobody is really checking whether the application servers can really connect to those MySQL nodes! This is one of the most classic errors to make in software programming: when testing for error conditions, you are not testing the thing you actually want to know the answer to, but testing something else. Kind of like the guy in the joke who was searching for his keys under the lamp where there was light, not where he actually lost the keys.
And finally, the point in this architecture where a lot of the problems culminate is the failover decision. In this architecture you typically have one writeable master. When something is wrong with that master, you need to do failover. The failover needs to happen atomically: if you have many application servers you cannot have a situation where some writes are still headed for the old master while others are already happening on the new one. Typically this is handled by using a virtual ip that is assigned to one node at a time. Sounds like a simple solution, but in practice it's entirely possible (for poorly implemented clustering solutions) to simply assign the virtual ip to both nodes at the same time! Robert Hodges has a nice blog posts about problems with virtual ip's.
Often failovers are also expensive, for instance with DRBD and SAN the failover procedure itself includes downtime for restarting MySQL, mounting disks and InnoDB recovery. This brings in the problem of false positives: Often these clustering solutions react to small hiccups in the network and start a failover, even when the situation is then quickly restored. I know of many DBA's that simply gave up on clustering solutions because of too many false positives. After all, your hardware + Linux + MySQL will crash less than once a year. If your clustering software has issues many times a year, it is making things worse, not better.
I firmly believe that this approach simply will not work. The committing of transactions, the replication of data and the detection of failures need to be done together by the same piece of software. You cannot outsource the checking of failures to some external software who checks in every 10 seconds if everything is ok. Also the need for doing any failover at all seems to create a lot of tension between the DBA and his cluster. This needs to be fixed...
Why Galera has none of these problems
The above novel gives a background to why I have been so interested in Galera lately. Let's look at how you'd implement a MySQL high-availability cluster with Galera:
Galera implements synchronous multi-master replication. So the first benefit of this is: there is no virtual ip, and I don't need to choose which node is the master and which are the slaves.
Note that the recommended setup for Galera is to have a minimum of 3 nodes. The third can often be left as a passive reference node not actually receiving connections itself. It is just needed for the cluster to elect a quorum in error situations - you always need to have at least 3 nodes in quorum based clustering.
So what happens when one node fails?
Notice the contrast to the previous list of problems:
- Thanks to synchronous replication and Galera's quorum mechanism, no commits are lost anywhere. When the failure happens, it will be detected as part of the replication process. The Galera nodes themselves figure out which node is thrown out of the cluster, and the remaining ones - who "have quorum" - proceed to commit the transaction. Application nodes that connected to the failing node will of course receive errors as their commits fail.
- There is no need for maintaining master-slave state, virtual ip or to do any failover. The application can connect to any of the Galera nodes and if the transaction succeeds, then it succeeds. If it fails, retry with another node.
- As a side comment: since the replication is synchronous there is no slave lag as you are familiar from MySQL replication, which can also cause you to lose data. This is not a weakness of clustering frameworks, but a strength of Galera compared to classic MySQL replication that most people out there still are using.
But wait, this is not yet the whole story.
We had a really nice lunch with Alex and Sakari from Codership (who produce Galera) before we both went on Summer vacation. So Alex pointed out that users are still not happy with the above. You see, even if we now have good data integrity and don't need to hassle with complex external clustering suites that really are just a bunch of init scripts on steroids... users are not happy, because when a node has failed, the application will continue to try to connect to it and you'll get a lot of connection errors at your application server. So in practice people still use either virtual ip or some kind of tcp/ip load balancer in front of Galera.
And it's not just that that's unnecessary and brings back some complexity that is not really needed, in the worst case those solutions risk bringing back a Single Point of Failure when we've just spent so much effort making MySQL highly available.
It then struck me that I may have been the only person in the world who actually knew that this is already a solved problem. It's no wonder, because the needed JDBC parameter is still not really explained anywhere in the MySQL manual. But the feature is there and a few MySQL Cluster users are using it in production. The best documentation you have is a blog I once wrote. While it talks about MySQL Cluster, the same works as is against a Galera cluster too - these two beutiful products share the same nice properties when it comes to high availability.
So in short, when using a Galera cluster, you should use
mysql:loadbalance: in front of your JDBC connection string. This allows you to then give a list of MySQL nodes, which are all writeable masters. The JDBC driver will connect to any one of them and commit the transaction. If a node is not available, it will just try another one. (If a transaction was already in progress, it will fail with an exception, you can then retry it and it will just connect to a new node.) You should read the blog post for details, in particular on how to use the blacklist timeout argument.
The above solution currently only exists for JDBC. I've been following Ulf Wendel's experiments with plugins to the PHP MySQL connector with interest. Even if his team has been focusing on read-write splitting, it seems doing the same for a multi-master cluster is not difficult (hint, hint...).
The beauty of this architecture is not only that it achieves sound data integrity, but also that we have now removed any additional components that are simply not needed anymore. We just have a layer of redundant application nodes and another layer of redundant database nodes, and that's all that's needed. This again proves the principle that the best solutions also tend to be simple solutions. If your solution is complex, you're not doing it right.