In this post I will share results on some "benchmarking" I did on the database created in the previous post: MariaDB 5.2: Using MariaDB as a document store and Virtual Columns for indexing. In addition to just playing with the new syntax, I wanted to actually benchmark using virtual columns against some other techniques. If you didn't read that previous post yet, please do so that you know the schema that is being used and the whole point of what we are doing.
The premise for this benchmark was already given last week:
Before I write the next blog, I invite you to guess the result of the benchmark. I had two conflicting rules of thumb as hypothesis:
- SELECTING via a virtual column only one word instead of a large XML document saves bandwidth and is therefore faster than the standard key-value fetch.
- SELECTING via a virtual column only one word instead of a large XML document uses more CPU in the database and is therefore slower.
I tested using both VIRTUAL and PERSISTENT virtual columns, and on a CPU bound, disk bound and network bound (ie performance dominated by network latency).
Place your bets, the results will be published later this week!
Note that these tests do not use the secondary key on 'username' for anything. Getting a secondary index is the main benefit of the schema I laid out in the previous post. But there is nothing to benchmark, since there isn't really anything to compare against. Without the secondary index you simply cannot filter on username at all (except for a full table scan, which is "not at all").
What I wanted to test rather was: Is there a difference one way or another if I SELECT the username column (which uses virtual columns) instead of the full doc column (which is just a normal key-value get)?
I've published the scripts I used on Launchpad: lp:~hingo/sysbench/mariadb-tests. There are 7 similar tests using the schema from the previous post. (...which you should have read by now :-)
- kv-xml.lua: SELECT the xml document by primary key. (Normal key-value.)
- kv-xml-vcol.lua --vcol-type=persistent: SELECT username column only, by primary key. Use PERSISTENT virtual columns.
- kv-xml-vcol.lua --vcol-type=virtual: SELECT username column only, by primary key. Use VIRTUAL virtual columns.
- kv-xml-viewsubstr.lua: SELECT username column from a VIEW. This compares the performance of using a view and a VIRTUAL virtual column. The view uses the same SUBSTR() method as the kv-xml-vcol.lua test.
- kv-xml-view.lua: SELECT username column from a VIEW. As above, but this test uses ExtractValue() instead of SUBSTR().
- kv-xml-extractvalue.lua: SELECT the username from the XML inside the doc column, using the ExtractValue() function directly in the SELECT statement. (No virtual columns or views involved, but computationally the same effort.)
- kv-xml-substr.lua: SELECT the username from the XML inside the doc column, using the SUBSTR() method that was used in kv-xml-vcol.lua directly in the SELECT statement. (No virtual columns or views involved, but computationally the same effort.)
The above in code:
kv-xml.lua: SELECT doc FROM sbtest WHERE id = ? kv-xml-vcol.lua: SELECT username FROM sbtest WHERE id= ? kv-xml-viewsubstr.lua: CREATE VIEW sbtest_v (id, doc, username) AS SELECT id, doc, SUBSTR(...) FROM sbtest ... SELECT username FROM sbtest_v WHERE id = ? kv-xml-view.lua: CREATE VIEW sbtest_v (id, doc, username) AS SELECT id, doc, ExtractValue(doc, '/user/username') FROM sbtest ... SELECT username FROM sbtest_v WHERE id = ? kv-xml-extractvalue.lua: SELECT ExtractValue(doc, '/user/username') FROM sbtest WHERE id = ? kv-xml-substr.lua: SELECT SUBSTR(...) FROM sbtest WHERE id = ?
I only ran the tests on my laptop (2 cores), using a default MariaDB 5.2 installation and my.cnf file (including the limitation of max 100 connections). Each run was short, only a few minutes. So I won't share the actual numbers in public, just summarize the behavior and conclusions. Maybe if there is interest, I can later ask Hakan to run the tests on the multicore servers he has access to, to get more reliable numbers.
The only change to the default config is that all SELECTs are done with the SQL_NO_CACHE to eliminate query cache. I used such a small dataset, that with the query cache on, all variants would have exactly the same performance because of being 100% served from the query cache.
CPU bound vs Disk bound vs Network bound
- With a small database that fits fully into the buffer pool, all of the 7 tests become CPU bound where MariaDB uses about 130% and sysbench about 65% (leaving the last 5% for the system, mostly X). In this scenario you will see differences with the above queries, which are reported below.
- With a large database the tests are disk bound. All tests have roughly the same performance. The performance is limited by disk seeks, so there is no advantage in SELECTing the virtual column username instead of the full doc BLOB. The hypothesis here was that fetching a smaller string would be faster than the whole blob. This is not true. (It is left as an excercise to the reader to find out how large the BLOB needs to be so that virtual columns provide a benefit here. A few megabytes? Gigabytes?
- I also tested between two laptops, in which case the performance is limited by network latency. Again, there is no difference in the size of data that is fetched. I was lazy and just used my WiFi, but my prior experience tells me that also in a faster network it is the round trip that is expensive, not the amount of data transferred.
Comparing the different tests
So it is mainly in the CPU bound case that the tests show any meaningful differences. While I won't share a lot of numbers, the below percentages show relative performance with each approach. (100% is 27242 tps)
kv-xml.lua 100% kv-xml-vcol.lua --vcol-type=persistent 99% kv-xml-vcol.lua --vcol-type=virtual 100% kv-xml-viewsubstr.lua 68% kv-xml-view.lua 60% kv-xml-extractvalue.lua 71% kv-xml-substr.lua 94% Any w/ Query cache on 136%
Of course, to get more reliable results, you'd need to do a longer test run on actual servers. But assuming the above numbers would be similar also on a multicore system, we could then draw the following conclusions:
Regarding virtual columns
- SELECTing the username column is exactly as efficient as selecting the whole BLOB, there is no difference one way or the other. This means, if you like the elegance of virtual columns, nothing stops you from using it, which will save you some space. (Note that we used SUBSTR() method here.)
- It doesn't matter whether you use VIRTUAL or PERSISTENT columns. There is no penalty in calculating the column value on the fly. (This result is a surprise to me, I expected PERSISTENT to be faster on SELECTs.)
- Comparing kv-xml-vcol.lua and kv-xml-substr.lua it seems that extracting the username as a virtual column is 5% faster than having the function in the SELECT and using a traditional table. 5% is large enough I believe this difference would show also on a larger system. It is not obvious to me why virtual columns are faster?
- It is probably due to random variation, but it intrigues me that VIRTUAL columns is slightly faster than PERSISTENT ones. I'm tempted to setup a more rigorous test to find out if this is true or just variation. One rule of thumb in writing high-performance code is that often using compact data results in better utilization of CPU caches, and this may often have a much higher impact on performance than many other factors. With VIRTUAL columns you move less data in the table, and then compute the column, whereas with a PERSISTENT column there is less computation but you actually move another column between RAM and CPU. So one could speculate that using a VIRTUAL column therefore would be 1-2% faster than a PERSISTENT column, even if this is counter intuitive at first. (Even the fact they are equal in a CPU bound test is counter intuitive if you ask me.)
Regarding ExtractValue() vs SUBSTR()
- Both with the function in the SELECT and using a VIEW, ExtractValue() is 25% slower than using SUBSTR(). It seems that it's implementation isn't as efficient as it could be.
- This result is funny to me! MySQL 5.1 had 5 new features, this is one of them. When I used to sell MySQL, many customers would ask in their RFQ whether MySQL has "XML support". They didn't specify it more than that, and when asked, they didn't even know what they wanted "XML support" for. So I always answered YES, to those questions, even educated almost 100 engineers at a company how to use ExtractValue() and UpdateXML(). According to this result, you should probably avoid these functions as much as possible. Use SUBSTR() as a cheat - if you know the schema of your XML beforehand it is safe (albeit ugly) to do so. Or just parse the XML in the application code instead - you should always avoid adding computation to the database if possible.
- I guess we were lucky that ExtractValue() wasn't supported for virtual columns, othewise I wouldn't have got these interesting results.
There is a rule of thumb that you should only do simple operations in the database, and move all computation to the application layer. This is because the application layer is much easier to scale out than the database. The above results suggest that a cheap operation like SUBSTR() is ok to do inside the database (if there is some benefit, of course), whereas you should watch out for functions that are a drag on performance. If you need to parse an XML document, do it in Java, PHP... not in MariaDB!
- Views are +30% slower than equivalent virtual columns. This means you should use virtual columns when possible. (Remember that Views can still do many other things a virtual column can't, such as joining two separate tables.)
- In this test, using a View was also much slower than just putting the function in the SELECT. If confirmed in a proper test run on larger systems, there is reason to suspect the implementation of Views is suboptimal.
Query cache: Since I first ran the tests with query cache on, we can see that it added 36% performance to the best case without query cache. This was of course a small dataset, but it can be used as a kind of "upper bound" for the best available performance. When query cache was on, MariaDB would use 100% of CPU and Sysbench around 80%.
- Add new comment
- 15860 views
XML Support in MySQL/MariaDB
Thanks for he benchmarks :)
warning: rant ahead :)
Re. XML Support in MySQL/MariaDB: IMHO, the tragedy is that MySQL chose to support features that are next to useless.
The specific use case for EXTRACTVALUE and UPDATEXML, supporting a key/value store in an XML blob, is certainly not something I would recommend. There may be some cases where I would but then I would do my best to keep the logic for searching and analyzing that data outside the database as much as possible.
It is possible to have serious XPath support, but it would require an XML data type with support for specialized XML indexes. But even then, I would probably not use it a lot - the relational model was invented for a reason. If your application requirements are such that you do not require a relational model, but would be served with XML document support, I would probably recommend switching to a DBMS that is designed from the ground up for that purpose.
Now, am I against XML support in RDBMS-es? No, not at all. But I feel that RDBMS is not a good fit for *storing* and *querying* XML data. Rather, I see XML as a data exchange format. There are many applications, including modern web-applications, but also middleware (web services) that utilize XML as data format. So what would make sense in my mind is to let the RDBMS do what it is good at, namely storing and querying relational data, and to extend SQL so it can *transform* relational data into XML documents. The solution proposed in the ANSI/IEC SQL standard is actually pretty decent in this regard. (Tip: you can do things like that today in MySQL using Arnold Daniels excellent lib_mysqludf_xql UDFs, see http://www.mysqludf.org/lib_mysqludf_xql/index.php) The other way around, getting the data from XML so it can be stored in the RDBMS makes sense too, but primitive XPATH support like offered by EXTRACTVALUE and UPDATEXML will make this task very painful and hard to maintain.
So what would useful XML support look like? I feel that Microsoft came up with the most elegant and useful solution yet in the form of XML-Views - see http://msdn.microsoft.com/en-us/library/aa258637(SQL.80).aspx
The idea is to create a special schema object (XML view) based on an XML Schema. Using standard techniques, the XML Schema that defines the view also defines how the XML structure maps to the structure of the relational database schema. The XML view works both ways: you can either select from it and obtain an XML document based on relational data queried from the RDBMS, or push XML documents into the view and have its data extracted and stored in the DB. Note that this is a completely declarative solution - it doesn't even rely on XPath, it simply uses the structure defined by the XML schema and it does so in a straightforward and natural manner.
A similar solution would be feasible for other document types like JSON, or CSV. The big win of using specialized view definitions is that it is stored and backed up in the database itself, which is appropriate since it is logic pertaining exclusively to data. Multiple applications can reuse the logic, minimizing the risk of reinventing the wheel (and making mistakes in the process). This solution also makes it easier to deal schema changes (both in the XML as well as in the RDBMS) since here is exactly one place where the mapping is defined.
Actually, your blog posts on ExtractValue() and UpdateXML() were how I learned about them some years ago! I think at that time the manual might have been shorter on these topics so your blogs had the best examples.
About XML-Views... I think the concept of using an XML schema with a mapping to 2 dimensional tables is ok. But it is not what I see as interesting. If your canonical data is stored as a RDBMS, then fine. There are many ways to convert it into other forms, XML-Views is one, XML in the app layer is another...
But I'm actually interested in the approach where data is stored as documents as the canonical source form. I've found this to provide a lot of flexibility which result in much more agile development. Just because I add another variable to a class, I don't need to also: 1) ALTER TABLE, 2) modify all INSERT, UPDATE and SELECT going to the database. I just continue to serialize the object and that's it.
So your suggestion is to continue using properly normalized relational tables, then "expose" them as XML too. My suggestion with this blog post is 1) you can just use MariaDB as a document store directly, then use virtual columns to extract back relational features as needed (but the XML/JSON document is the canonical data) and 2) you don't need to use one of the specialized document oriented (NoSQL) databases, since using MySQL/MariaDB gives you the same features (and more) in a familiar environment.
Hi Henrik! I fear my
I fear my original comment did too little justice to your approach and the work you did on the benchmarks. I apologise for that - I should've taken more time to explain my stance.
Anyway, I did write a blog about those functions, and I'm glad the article was useful to you. But wouldn't you agree that that API is quite clunky? For example, setting up a loop to repeatedly extract data from a set of elements is not very obvious unless you already have XPath skills. Doing non-trivial things in XPath is actually pretty hard...Well, I understand that what one considers convenient or elegant is ultimately a matter of taste but still...there's got to be a better way. At least, I think so...
I can certainly see that sometimes, a JSON or XML blob is the right approach. But I would probably almost never want to query that data. Well for things I do not want or need to query, in those cases, I think a "schemaless" or flexible schema type of solution is fine. But outside those cases (and I think this is the majority) I feel schemaless is not really a blessing. Sure, with a schema there will be some development overhead to synchronize the object model and the relational schema. But really - the work is always very clear and very straightforward in these cases, and I dare say, doing this bit of work upfront to ensure data quality and consistency in the long term is actually well worth it.
So IMO, the particular use case you describe is actually pretty easy to solve at whatever level you want to. Solutions for problems that were easy in the first place are nice, but don't really make me tick. A solution to remove the complexity of mapping object serialization or document formats to relational models are in my experience not so trivial, and require quite a bit of architecting. One could take an ORM but typically those are application language dependent and introduce another level of complication since they typically support their own query language on top of SQL. So to have this bit inside the database would really fill a gap IMO.
The E-BLOB pattern can reduce
The E-BLOB pattern can reduce io and complexity significantly when there are a lot of optional attributes in your data structure.
An example of this are 'feats' in a video game. There may be dozens or even hundreds of feats, and each has different attributes. If you dedicate a relation per feat, then you could have multitudes of tables with tens of columns each.
Alternatively, a column that contains an E-BLOB, with some column attributes extracted is more manageable.
You could use Flexviews (or triggers) to create "indexes" (as separate attribute tables) and never have to worry about ALTER TABLE or "OSC" at all.
One advantage that virtual columns have over triggers is that the value for each expression is populated immediately. If you want to use triggers, then you have to go to through some pain to create the index in a way that is consistent (in a transactional sense) with the base table.
I fear my original comment
I fear my original comment did too little justice to your approach and the work you did on the benchmarks. I apologise for that - I should've taken more time to explain my stance.
No harm done. As I read it, you agreed with me, then introduced something new onto the table. Your knowledge of XML in databases still exceeds mine, and is most welcome!
I can certainly see that sometimes, a JSON or XML blob is the right approach. But I would probably almost never want to query that data.
It seems to me, many "web-scale" environments are more and more like this today. (In fact, telco too, now that I think about it.) Their main priority is to just keep up with the performance requirements for their simple key-value operations. Doing any other queries than that is out of the question anyway because a) it takes away performance from the main priority and b) the dataset is just too large to do any other queries than the simple primary key based ones.
The more complicated analysis is done in a separate database: A datawarehouse (snowflake and everything) or some Hadoop cluster or whatever. To get the data there you need an ETL process, so it seems to me for that analysis it doesn't matter what format the operational database is.
One could take an ORM but typically those are application language dependent and introduce another level of complication since they typically support their own query language on top of SQL. So to have this bit inside the database would really fill a gap IMO.
I agree about the ORM's. I've never got the point with them. Their query language seems equally complex to SQL, I just don't see how they make anyones life any easier.
While I understand how XML-Views is a nice way to map relational tables to XML, it is not immediately obvious to me how they would make life easier for a Java or PHP developer. I would then just have to instantiate my objects from XML, instead of SQL.
Hi! "While I understand how
"While I understand how XML-Views is a nice way to map relational tables to XML, it is not immediately obvious to me how they would make life easier for a Java or PHP developer. I would then just have to instantiate my objects from XML, instead of SQL."
There are 2 things that I think are important here: if you have to instantiate objects directly from SQL, you have to have multiple roundtrips to the DB to account for relationships. IN addition, you need to post process the data sets to stitch the related sets back together again.
In some cases, lazy evaluation is fine for that, but in some other cases you really want to grab a set of complete objects, including their constituent objects (and possible, their constituent objects and so on).
If the relationships are such that there is only one type of constituent object each time, you can optimize it into one roundtrip by joining the tables and do grouping at the client side to normalize the join result back to objects. But still, this grouping is not something you really want to do when building an application - this should be part of the infrastructure. PLus this approach doesn't work when you have multiple different constituent objects that are themselves not related to each other. For example, in a social network, "friends" and "hobbies" may be constituent objects of the user object, but they are not related to each other. So writing a join of user, hobbies and friends leads to a cartesian product of all hobbies and friends per user.
Now, XML and JSON have no problem expressing these kinds of relationships and it is relatively straightforward to instantiate objects from these kinds of documents. But efficiently generating these documents is not so straightforward. It's certainly doable and even relatively simple for a single user, but if you want to retrieve a set of users, along with their friends and hobbies, you need multiple queries and processing to merge the sets to reflect the proper nesting. This is something yo don't want to bother with when building applications - there's just too many choices to productively implement it, and there are too many things to mess up.
If the RDBMS would be aware of these requirements, data access could be optimized for it. Currently, RDBMSes are typically not capable of doing these things because of the strong focus on SQL. SQL can only deliver tabular results - typically you can't create sub 1NF results. There are some constructs in ANSI SQL that allow sub 1NF results, but from what I have seen so far the syntax is really clunky IMO. I also believe (but I should check) that the mapping of these nested result structures to host language structs or objects is not standardized, and this is where JSON or XML would help.
XML-Views, got it!
Thanks for explaining. I think I got it now.
So with XML-Views I save roundtrips, which is good, and in general it allows me to query "bundles" of data without having to make a big 2 dimensional table out of it. That is useful indeed.
Btw, this is a criticism I've heard about before from the NoSQL camp. Some NoSQL databases have the notion of data "bags" which is kind of like a JOIN, but the different tables are returned as separate lists (in one roundtrip), without cartesian join like SQL traditionally does. I do see how this is more efficient, and also makes better sense, like you also explain above.
So now XML-Views is my RDBMS counterpart for bags :-)
Hi Henrik, well, XML views is
well, XML views is just one way to solve it. Another way to solve it would be to extend SQL to truly allow sub 1NF resultsets.
Frankly I don't expect a whole lot of that, since SQL wasn't originally designed to do that.
That doesn't prevent one from designing a query language that can deal with these constructs from the ground up.
I have been looking at MQL, the query language used by freebase. I think that is a very interesting one, and I think the mapping to the RDBMS is rather natural. It uses JSON rather than XML, but that is mostly a matter of representation, not concept.
I put up a proof of concept implementation of it here: http://code.google.com/p/mql-to-sql/
Wiki explains a lot about how it works, and there's also a link to an online demo (based on MySQL sakila).
Virtual Columns vs using Functions in the select clause
Is there a difference between using a virtual column and using the function in the select clause. Are there any additional benefits I get of using virtual columns?
Instead of creating a virtual column, can't I do something like this
select col1, col2, calculate(col1, col2) as col3 from tab1;
rather than using
select col1, col2, col3 from tab1; -- where col3 is a virtual column.
If you use PERSISTENT virtual
If you use PERSISTENT virtual columns, then you can index them. This is imho useful in many scenarios.
For the non-persistent variant, it is more of syntactic sugar, like using a view instead of repeating the function in each select. But it turns out virtual columns are much faster than views.
Add new comment