Document oriented DB basics
eval("data = " + jsondocument); to get the data into a variable (unserialize). For more information about JSON and how a JSON database would work, see the Wikipedia article and for instance the MongoDB manual.
Another format that has been used for document stores is XML documents, and many programming languages can serialize objects into XML.
In my exploration I've used XML, since MySQL/MariaDB has XML XPath functions builtin since 5.1. The below approach could also be used for JSON or any other format, but then you'd need functions to parse JSON documents, and by googling I couldn't even find a User Defined Function that would do that. (And even if I had one, Virtual Columns doesn't seem to support UDF functions.)
Using MariaDB as a document store (schemaless)
MySQL has been successfully used in a document store resembling way. A couple years ago Friendfeed posted an excellent narrative of their migration to what they call a "schemaless" MySQL database. They concluded that this approached allowed them to stay on MySQL as a stable, proven and familiar database, rather than migrating to one of the new NoSQL databases.
I have personally also used this approach sometimes when quickly putting together some PHP site, since I like the flexibility I get when I can just toss new elements into my PHP array without having to ALTER TABLE in the database all the time. However, FriendFeed's experience was that they also got much better performance thanks to the simplicity of the design.
So what you do is that you simply fall back to using MariaDB as a simple key-value store, such as:
CREATE TABLE xmlstore ( id INTEGER UNSIGNED NOT NULL, doc TEXT, PRIMARY KEY (id) ); INSERT INTO xmlstore (id, doc) VALUES (1, '<user> <id>1</id> <username>hingo</username> <name>Henrik Ingo</name> <status time="2010-21-10 13:16">I\'m writing an example XML document</status> <friends> <friend_id>9</friend_id> <friend_id>91</friend_id> <friend_id>92</friend_id> <friend_id>93</friend_id> <friend_id>94</friend_id> <friend_id>95</friend_id> <friend_id>96</friend_id> <friend_id>97</friend_id> <friend_id>98</friend_id> <friend_id>99</friend_id> </friend> </user>');
Now your actual record is completely inside the
doc BLOB. MariaDB doesn't care about the fields in it. Different records can have different fields. If you want to change the "schema", you simply store some different data on the application level and push that into MariaDB instead.
The uptake here compared to a traditional fully normalized RDBMS approach is flexibility, and apparently also performance in FriendFeed's case. Naturally you lose the ability to do any queries on the fields in the XML document, you can just fetch by primary key, which you therefore need to know. This is why document stores are also called key-value databases.
Using Virtual Columns to create secondary indexes for the document
But suppose we also need to fetch this data based on the
username. Thanks to virtual columns, we can actually add an automatically generated column and index it. MySQL 5.1 introduced the
ExtractValue() function where you can use XPATH expressions to extract snippets from an XML document:
MariaDB [test]> DROP TABLE xmlstore; Query OK, 0 rows affected (0.04 sec) MariaDB [test]> CREATE TABLE xmlstore ( id INTEGER UNSIGNED NOT NULL, doc TEXT, username CHAR(10) AS (ExtractValue(doc, '/user/username')) PERSISTENT, PRIMARY KEY(id), KEY (username) ); ERROR 1642 (HY000): Function or expression is not allowed for column 'username' MariaDB [test]>
Ok, so this is where my beutiful plan falls apart. The needed XML function isn't supported for virtual columns :-( It is deterministic, so it should be allowed. Philip has confirmed this is a bug and will be fixed in future releases. (Actually, it seems to have been excluded already in Zhakov's original implementation.)
That was kind of a bummer. While waiting for the bug to be fixed, I can still continue my testing just using
-- 10 is a magic number, it is the length of the string '<username>' MariaDB [test]> CREATE TABLE xmlstore ( -> id INTEGER UNSIGNED NOT NULL, -> doc TEXT, -> username CHAR(10) AS (SUBSTR(doc, -> LOCATE('
', doc)+10, -> LOCATE('', doc)-(LOCATE(' ', doc)+10)) -> ) PERSISTENT, -> PRIMARY KEY (id), -> KEY (username)); Query OK, 0 rows affected (0.01 sec) -- same insert as above... MariaDB [test]> SELECT * FROM xmlstore\G *************************** 1. row *************************** id: 1 doc: <user> <id>1</id> <username>hingo</username> <name>Henrik Ingo</name> <status time="2010-21-10 13:16">I'm writing an example XML document</status> <friends> <friend_id>9</friend_id> <friend_id>91</friend_id> <friend_id>92</friend_id> <friend_id>93</friend_id> <friend_id>94</friend_id> <friend_id>95</friend_id> <friend_id>96</friend_id> <friend_id>97</friend_id> <friend_id>98</friend_id> <friend_id>99</friend_id> </friend> </user> username: hingo 1 row in set (0.00 sec) MariaDB [test]> EXPLAIN SELECT * FROM xmlstore WHERE username='hingo'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: xmlstore type: ref possible_keys: username key: username key_len: 31 ref: const rows: 1 Extra: Using where 1 row in set (0.00 sec) MariaDB [test]>
So while waiting for
ExtractValue() to become supported, we can still use
SUBSTR() for any non-complex XML document to extract the node that we want to index, and a SELECT on it will use the index. This is now a very useful document oriented database we have created.
Limitations: no secondary indexes for shards
The main limitation I can think of here is that the secondary indexes created with virtual columns cannot be sensibly used if your database is sharded. But now that I think about it, this is true for secondary indexes just in general.
The FriendFeed narrative has a nice example how they used a separate table to implement each of their index. (And each such table is then sharded on its own.) This has the added benefit that it nicely also works around the limitation that MySQL/InnoDB doesn't support online ALTER TABLE operations, so you cannot (easily) add or drop indexes on a large production database, but you can easily create and drop tables! (But see Facebook's Online Schema Change for a nice workaround-tool.) On the other hand, it then becomes a burden for the application layer to maintain all indexes, whereas with virtual columns this is handled by the database. Also in the FriendFeed implementation the secondary "index tables" are not guaranteed to be consistent with the actual data table (because you don't have transactions across shards) which makes it even trickier for the app layer to actually use them.
I guess MySQL Cluster again deserves a mention here: it has transparent sharding, including for secondary indexes. These are consistent across the cluster, they are updated and read within a transaction. And nowadays it even sports ADD INDEX as online operation! I suppose this is yet another technique where MySQL Cluster might just rock!
Coming up next: The benchmark
I mentioned I did some sysbench runs on the above schema. I have already written enough for one post, so I will save that for a separate post. The bet is still on :-)