Writing a data loader for database benchmarks

A task that I've done many times in my career in databases is to load data into a database as a first step in some benchmark. To do it efficiently you want to use multiple threads. Dividing the work onto many threads requires good comprehension of third grade math, yet can be surprisingly hard to get right. The typical setup is often like this:

  1. The benchmark framework launches N independent threads. For example in Sysbench these are completely isolated Lua environments with no shared data structures or communication possible between the threads.
  2. Each thread gets as input its thread id i and the total number of threads launched N.
  3. I want to create M tables/collections with data. The challenge is to divide M evenly over N threads.
  4. Alternatively I want to create a single table/collection with M records. The challenge is the same and so is the solution. For this blog post I will focus on M collections, and a single thread loads one collection alone.
  5. Typically there's a need to generate names for the tables/collections that are contiguous and deterministic, like collection0, collection1, etc... And the same goes for id's / primary keys of the records.

The naive solution is to just divide M / N. For example, say I want to use 16 threads to create 100 collections. 100 / 16 is 6.25 and rounding down to a full integer that gives 6 collections per thread. The collections need a unique name. Given the input data, what we can do is to use the thread id to compute a unique and contiguous range for each thread to use for names. So for example thread i will know that the threads 0 .. i-1 will create collections 0 to (i-1) * floor(M / N). So continuing from example above, thread nr i=3 should create 6 collections, starting from collection12 and ending at collection17. In Sysbench Lua code it might look like this:


function parallel_prepare (thread_id, num_threads)
    if thread_id > sysbench.opt.num_collections then
        return
    end
    local my_num_collections = math.floor(sysbench.opt.num_collections / num_threads)
    local my_first_collection = thread_id * my_num_collections
    -- unlike python, lua includes the last parameter to for
    for n=my_first_collection, my_first_collection + my_num_collections - 1 do
        -- do the actual work
        ...
    end

Many readers may have noticed the above naive implementation will only create 96 collections, not 100. We need to also account for the remainder of 4 collections that happens when we round down my_num_collections to an integer. The easy solution is to just let the last thread take care of the remainder. The nice thing about this approach is that the math for all the other threads stays the same. So we just add:


    -- last thread takes care of the remainder
    if thread_id == num_threads-1 then
        my_num_collections = my_num_collections + sysbench.opt.num_collections % num_threads
    end

Great! We now have a correct loader for our database benchmark. Time for performance analysis. If we now use the above code to create 100 collections with 16 threads, then threads 0 to 14 will create 6 collections, and thread 15 will create 6+4 = 10 collections. If we simplify and assume that the system isn't saturated in any component, then the total time for the load is the time it take to create 10 collections. That is a 10x speed improvement compared to just using a single thread. But ok, what if we want to go even faster? Let's use 32 threads. Now most threads create 3 collections, and the last thread creates 3 + 4 = 7. So the total time for the load phase is the time it takes to create 7 collections. Let's double again: I'll use 64 threads. Now each thread will create 1 collection, and the last thread will create 1 + 36 = 37 collections. So the total time for the load phase is the time it takes to create 37 collections. Wait, what? In most cases the remainder is not a big deal, and just leaving it to the last thread is fine. But this is not the optimal solution and the example above illustrates a case where it's severely counter productive. The optimal solution of course is to also spread the remainder over multiple threads. By definition the remainder is smaller than the total number of threads, so it is always possible to add 1 collection per thread. But now we have some threads with one collection more than others. This makes that calculation of my_first_collection harder! Here's my Lua code:


function parallel_prepare(thread_id, num_threads)
    if thread_id > sysbench.opt.num_collections then
        return
    end
    local my_num_collections = math.floor(sysbench.opt.num_collections / num_threads)
    -- share the remainder so that each thread gets one each, starting with thread_id 0
    if thread_id <  sysbench.opt.num_collections % num_threads then
        my_num_collections = my_num_collections + 1
    end
    local my_first_collection = thread_id * my_num_collections
    -- For threads larger than the remainder, my_num_collections is 1 smaller, so compensate and add the remainder back
    if thread_id >= sysbench.opt.num_collections % num_threads then
        my_first_collection = my_first_collection + sysbench.opt.num_collections % num_threads
    end
    local my_last_collection = my_first_collection +my_num_collections - 1
    -- unlike python, lua includes the last parameter to for
    for n=my_first_collection, my_last_collection do

Now the total time for the load phase is always optimal ceiling(M / N). For 16 threads it is the time it takes to load 7 collections, for 32 threads it is 4, and for 64 it is 2. Above code is from our repository of Sysbench benchmarks for MongoDB, in particular many_collections.lua and load.lua.

I am a big fan of making things optimal for a model. The model here is that number of docs inserted predicts insert time. Do you a solution for the case where that isn't true?

Add new comment

The content of this field is kept private and will not be shown publicly. Cookie & Privacy Policy
  • No HTML tags allowed.
  • External and mailto links in content links have an icon.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Each email address will be obfuscated in a human readable fashion or, if JavaScript is enabled, replaced with a spam resistent clickable link. Email addresses will get the default web form unless specified. If replacement text (a persons name) is required a webform is also required. Separate each part with the "|" pipe symbol. Replace spaces in names with "_".
About the bookAbout this siteAcademicAccordAmazonBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube