Video x2: Measuring performance variability of EC2

I was recently invited to speak at Fwdays Highload in Kyiv. This was my first ever visit to Ukraine, so I was excited to go and visit this large and beautiful European capital. Over a thousand years ago Vikings would row their boats through the rivers in Russia, and take the Dniepr southward to Kyiv and ultimately Turkey. It was exciting to travel in the footsteps of my forefathers.

My talk isn't really MongoDB specific, rather about an EC2 performance tuning project we did in 2017:

Working in the MongoDB Server Performance Testing team, we use Amazon EC2 for system level testing. This allows us to flexibly deploy and tear down MongoDB clusters of various topologies, day after day. On the other hand, using a public cloud for performance testing can be challenging for repeatability of test results - to put it mildly. We, therefore, ended up spending several months just benchmarking EC2 itself. We compared combinations of different instance types and disks (ephemeral SSD vs PIOPS EBS). In the end, we found that the largest impact in reducing variability came from the same configuration options that we use on physical HW as well: turning off hyperthreading, using numactl and turning off CPU power saving states. Thus, you could argue that blaming "the cloud" for our performance trouble was wrong. It's possible to get similar performance characteristics from EC2 as physical hardware when used correctly, and when used incorrectly, both physical and cloud hardware will perform poorly.

With the new configuration, we've been able to greatly lower variability of our daily performance tests, and increase trust in the test results. For WiredTiger tests, even the worst case is less than 10% min-max range, and MMAPv1 is close to that. We consider this to be below the threshold of performance change that most end users are able to observe anyway, hence it is sufficient for our performance testing purposes.

The results also emphasized a golden rule of performance engineering: measure everything, assume nothing. It turned out the configuration, that was originally used for our performance testing, actually had the worst variability of all configurations we tested!

Fwdays team have now published my talk recording: