Running Better Performance Experiments

18thDec 2020 by Benoit Girard

ABOUT THE AUTHOR

Benoit Girard is Web Performance Engineer @ Facebook, ex: Mozilla Platform Engineer. Passionate for pushing the limits of the open web. He's currently working on the performance for the new facebook.com.

The goal of running a perf A/B experiment is to validate that some performance metrics, such as page load or interaction times, improve. The secondary goal is to establish that other crucial engagement metrics are also improving, such as conversions, cancellations or time spent metrics. For a site that’s already reasonably well-optimized, it can be very challenging to show that incremental perf improvements you’re making are impacting secondary engagement metrics. Let’s discuss why that is, and how we can drive better performance experiments.

A/B experiment setup & fundamentals

Let’s say that you’ve read the Web Perf Calendar, found a cool optimization technique to defer non-critical resources to load at interaction time. You apply it for your share button, and confirm locally that it’s behaving correctly. You’ve implemented a 50 KB reduction to your JavaScript bundle on startup, going from 500 KB to 450 KB. For your desktop user base, this is a noteworthy win, but it’s not earth shattering. You’re ready to run an A/B experiment anyways!

Before running an experiment, it helps to have a hypothesis that we’re looking to test. In this case, you might reasonably hypothesize that your visually complete (VC) metric will get faster by 50 milliseconds. Now you’re also hoping to see an improvement in your conversions for the call to action sign-up form, but have no idea by how much it might improve. And maybe you’ll see a revenue increase too!?

Metrics	Hypothesis
Visually Complete (VC)	-50ms
Call to Action Conversion Lift	🤞Small Improvement 🤞
Revenue	🤞Small Improvement 🤞

But can we even detect the full impact of our experiment? Here we have to think about the statistical power of our A/B test. The minimum detectable effect and other concepts are out of scope for this article, but I encourage you to familiarize yourself with them. Stated in plain English, the minimal detectable effect for your VC metric is the smallest change you can detect with 95% confidence. This will be a function of how noisy or variable your Visually Complete metric is in your population, how long you run the experiment, and how big your A/B population is. Your A/B testing framework or a sample size calculator should help you determine and estimate this. Once you’ve done this you should know how big of a population and how long you’ll need to run your study. Maybe you don’t know this and are planning to run the largest study you can, that can be fine too.

In the end, you decide to run an A/B test on 10,000 users for one week, and you collect about 100,000 startup data points giving you a minimum detectable effect of 10ms. Since 50ms > 10ms you should have no problem validating this change in production. However, if you look at your other metrics you might notice that you need a 5% lift in your call to action to detect the impact, which is out of reach by any stretch of the imagination. Well 😞, but 50ms is 50ms, so you run your experiment anyways.

Reviewing your wins

Friday rolls around, you sip your coffee and check your results in the morning. 🎉 it’s a 65msÂ±10ms Visually Complete win! You get a small dopamine rush for implementing this win, and you brag to your coworker(s) at standup. Soon, the novelty wears off, and you’re staring at a statistically neutral delta to your call to action and revenue metrics. All that you know is that your experiment doesn’t significantly regress or improve your engagement metrics by more than the minimum detectable effect. You know better than to draw conclusions here!

It makes sense when you think about it. A 65ms VC improvement might only prevent a tiny fraction of abandonments. Out of that tiny fraction of saved abandonments you expect to see 10% more conversions. Now we have 10% of a small number which is even smaller. Perhaps only a small percent of your call to action converts to a sale and increases revenue, so your 50ms win will be even harder to measure against revenue. This takes you below the minimum detectable effect.

What do you do now!? It’s tempting to ship it – perhaps even give up and put performance on the back burner. But wait! There’s more we can do!

Enhance! Increase your experiment signal

How do we get above the minimum detectable effect? We’re going to need more wins, a lot more wins! The very simple, but key insight is once you have a well-optimized site you won’t have a single perf project that will cut your load times by 20%, your call to action conversions by 5%, and your revenue by 2% – but you still need to get there. The key here is to chain these wins by grouping individual tests into a single master experiment that can get your metrics above the minimum detectable effect. Even if you can’t detect it in an individual experiment, every time you improve your performance, you might be shifting your conversions. Just because you can’t see it, does not mean the win isn’t there. Once you chain 5, 10, or 15 of these wins into a single experiment, each of these improvements stack, and get you closer to a statistically significant win!

My recommendation is work on a set of simpler changes and validate each of them against your high-signal performance metrics. Make sure they’re all improvements. Each time a change is positive, join it with your short-lived group of performance experiments. Once you have sufficient wins to get above the minimum detectable effect for your desired metrics, run your group A/B test. And voilÃ 🥳! If you stack enough wins together, you may just start to see all kinds of secondary engagement metrics turn positive – even some that you didn’t have proof were even connected to performance. Now, you can confirm that performance mattered all along, even in places you didn’t know it mattered!

Web Performance Calendar