We then finally came to the topic that comes naturally to anyone familiar with the MySQL architecture. Could Hadoop, or Hive, or whichever, be plugged into MySQL as a storage engine? And why would you want to do that? And can Timour's work to push down JOIN conditions be of any help? (The last question was interesting since Ted and his team were inherently against talking about JOINs at all :-)
We ended up concluding that 2 things could be interesting and considered "low hanging fruit":
SQL to PIG compiler.
(Btw, it is not immediately obvious if this task needs MySQL at all.)
The premise for this discussion was that the structure and dynamics of something like a Hadoop cluster is so different from anything MySQL does now, that we don't expect the current optimizer to be much useful in actually retrieving data from a NoSQL cluster (Hadoop or HBase or otherwise). By transforming an SQL query to PIG, you could then pass the task on to your PIG implementation and benefit from the optimizer work already ongoing there.
Hadoop as MySQL storage engine
We eventually decided that one could develop a MySQL storage engine that would tap into a Hadoop cluster as a datasource. Ted had many questions about this, and here are the main points to know if you don't know MySQL internals deeply:
- The relevant unit to think about is a 2 dimensional table. The storage engine provides a mechanism to manipulate and query data from one or more 2 dimensional tables. How data is physically stored, is left to the engine implementation.
- So essentially the storage engine can be a source or sink of arbitrary data. As an example I mentioned Jan Kneschke's Wormhole engine, which allows you to create Lua scripts to return arbitrary data. (so it does not actually store any data)
- A Hadoop storage engine could therefore be based on the concept of defining mappings from the Hadoop cluster into MySQL tables. The same Hadoop data could be the source for more than 1 table, the mapping can be rather arbitrary.
- Where a Hadoop user would typically run MapReduce jobs to produce result sets from the original source data, these can be thought of as materialized views in the SQL paradigm. Say I have mapped some data to be visible as MySQL tables a and b, a certain JOIN (with certain conditions) between a and b could be defined as the result set c from a MapReduce job that is periodically run. (The fact that at a given point in time the result set c does not actually reflect a real-time join of a and b is considered a feature, in the "eventual consistency" paradigm.
- An independent topic was how to handle the .frm files, which is MySQL's way of storing the table metadata as a local file in the filesystem. The issue here is that a) your Hadoop "table" already exists and b) in a cluster of MySQL front ends to a Hadoop cluster, it is not trivial to make sure that all local frm files are in sync. This was long a problem in MySQL Cluster too, has been somewhat solved now. Sanja was well familiar with this code, but we did not delve deeper into it.
A simple way to start would be to develop an engine that can expose Avro files as tables in MySQL. Avro is the serialization format that seems to have become standard now in the Hadoop world. (Essentially the source and result of a MapReduce job, would be an Avro file.)
The motivations to expose Hadoop data as MySQL tables are simple: 1) You could run your traditional reporting tools against it and 2) you could possibly have some benefit of having Hadoop data available in an existing database, where the other tables are normal InnoDB tables and 3) as a special case, use the Hadoop table as a data sink where you insert data later to be processed by MapReduce. An interesting test would be to see if such a table would be a better target for log data than Archive engine is. Probably would be.