A task that I've done many times in my career in databases is to load data into a database as a first step in some benchmark. To do it efficiently you want to use multiple threads. Dividing the work onto many threads requires good comprehension of third grade math, yet can be surprisingly hard to get right. The typical setup is often like this:

- The benchmark framework launches N independent threads. For example in Sysbench these are completely isolated Lua environments with no shared data structures or communication possible between the threads.
- Each thread gets as input its thread id
*i*and the total number of threads launched*N*. - I want to create M tables/collections with data. The challenge is to divide M evenly over N threads.
- Alternatively I want to create a single table/collection with M records. The challenge is the same and so is the solution. For this blog post I will focus on M collections, and a single thread loads one collection alone.
- Typically there's a need to generate names for the tables/collections that are contiguous and deterministic, like collection0, collection1, etc... And the same goes for id's / primary keys of the records.

The naive solution is to just divide *M / N*. For example, say I want to use 16 threads to create 100 collections. 100 / 16 is 6.25 and rounding down to a full integer that gives 6 collections per thread. The collections need a unique name. Given the input data, what we can do is to use the thread id to compute a unique and contiguous range for each thread to use for names. So for example thread *i* will know that the threads *0 .. i-1* will create collections 0 to *(i-1) * floor(M / N)*. So continuing from example above, thread nr i=3 should create 6 collections, starting from *collection12* and ending at *collection17*. In Sysbench Lua code it might look like this:

```
function parallel_prepare (thread_id, num_threads)
if thread_id > sysbench.opt.num_collections then
return
end
local my_num_collections = math.floor(sysbench.opt.num_collections / num_threads)
local my_first_collection = thread_id * my_num_collections
-- unlike python, lua includes the last parameter to for
for n=my_first_collection, my_first_collection + my_num_collections - 1 do
-- do the actual work
...
end
```

Many readers may have noticed the above naive implementation will only create 96 collections, not 100. We need to also account for the remainder of 4 collections that happens when we round down my_num_collections to an integer. The easy solution is to just let the last thread take care of the remainder. The nice thing about this approach is that the math for all the other threads stays the same. So we just add:

```
-- last thread takes care of the remainder
if thread_id == num_threads-1 then
my_num_collections = my_num_collections + sysbench.opt.num_collections % num_threads
end
```

Great! We now have a correct loader for our database benchmark. Time for performance analysis. If we now use the above code to create 100 collections with 16 threads, then threads 0 to 14 will create 6 collections, and thread 15 will create 6+4 = 10 collections. If we simplify and assume that the system isn't saturated in any component, then the total time for the load is the time it take to create 10 collections. That is a 10x speed improvement compared to just using a single thread. But ok, what if we want to go even faster? Let's use 32 threads. Now most threads create 3 collections, and the last thread creates 3 + 4 = 7. So the total time for the load phase is the time it takes to create 7 collections. Let's double again: I'll use 64 threads. Now each thread will create 1 collection, and the last thread will create 1 + 36 = 37 collections. So the total time for the load phase is the time it takes to create 37 collections. Wait, what? In most cases the remainder is not a big deal, and just leaving it to the last thread is fine. But this is not the optimal solution and the example above illustrates a case where it's severely counter productive. The optimal solution of course is to also spread the remainder over multiple threads. By definition the remainder is smaller than the total number of threads, so it is always possible to add 1 collection per thread. But now we have some threads with one collection more than others. This makes that calculation of *my_first_collection* harder! Here's my Lua code:

```
function parallel_prepare(thread_id, num_threads)
if thread_id > sysbench.opt.num_collections then
return
end
local my_num_collections = math.floor(sysbench.opt.num_collections / num_threads)
-- share the remainder so that each thread gets one each, starting with thread_id 0
if thread_id < sysbench.opt.num_collections % num_threads then
my_num_collections = my_num_collections + 1
end
local my_first_collection = thread_id * my_num_collections
-- For threads larger than the remainder, my_num_collections is 1 smaller, so compensate and add the remainder back
if thread_id >= sysbench.opt.num_collections % num_threads then
my_first_collection = my_first_collection + sysbench.opt.num_collections % num_threads
end
local my_last_collection = my_first_collection +my_num_collections - 1
-- unlike python, lua includes the last parameter to for
for n=my_first_collection, my_last_collection do
```

Now the total time for the load phase is always optimal *ceiling(M / N)*. For 16 threads it is the time it takes to load 7 collections, for 32 threads it is 4, and for 64 it is 2. Above code is from our repository of Sysbench benchmarks for MongoDB, in particular many_collections.lua and load.lua.

- Log in to post comments
- 853 views

## Optimal

I am a big fan of making things optimal for a model. The model here is that number of docs inserted predicts insert time. Do you a solution for the case where that isn't true?

## The above is of course

The above is of course simplified in the sense that you can't always add more threads and expect things to get faster. But beyond that I might not have fully understood what you're thinking of. Can you elaborate?

## I think you defined optimal

I think you defined optimal in terms of number of rows/docs per thread. Rows/docs doesn't always predict amount of time (or work) required.

## In this post target is to

In this post target is to optimize (minimize) the run time of the load phase.