A/B testing and front-end optimization: Not as easy as A-B-C

21stDec 2012 by Joshua Bixby

ABOUT THE AUTHOR

Joshua Bixby (@JoshuaBixby) is president of Strangeloop, which provides website acceleration solutions to companies like eBay/PayPal, Visa, Petco, Wine.com, and O'Reilly Media. Joshua also maintains the blog Web Performance Today, which explores issues and ideas about site speed, user behavior, and performance optimization.

To date, the most compelling stories in the performance arena have all involved some form of A/B testing, in which we compare two groups of users and demonstrate scientifically that faster pages make for better business results. Although the long list of case studies that prove this link has made believers out of many CTOs, some still demand A/B testing to demonstrate the value of acceleration.

If you hear the tech elite at conferences like Velocity speak, you’ll find that people talk about A/B testing as if it were the most natural and easy thing in the world to do. I often hear about how statistically viable results were derived in “minutes” from “small fractions of the traffic”, how these results were “easy” to gather and “intuitive” to analyze. My experience with A/B testing for performance has been anything but “easy”.

Over the years, as we at Strangeloop have developed our testing methodologies for our customers, we’ve accumulated some hard-won lessons. Today, I want to share some of these lessons with you.

First, I’m going to share with you a typical customer experience. Second, I’m going to get into some of the finer nuances of testing. Lastly, I want to spend a fair bit of time explaining a process for ensuring your A/B test data is statistically significant.

Real-world A/B testing scenarios

I’ve had the privilege of participating in many A/B tests around performance. These tests try to assess the value of a new performance enhancement (or in my case a suite of performance enhancements). The tests are often structured with two distinct groups, Accelerated vs Unaccelerated (or status quo).

I’ve found that tests typically follow one of two patterns:

Scenario 1: The good one (AKA the one we talk about)

Day 0: CTO demands an A/B test to prove the value. The CTO is a bit nervous and wants to start with a small % of traffic accelerated (say 5%).
Day 0.5: CTO does not hear any complaints from clients and the site does not crash.
Day 1: CTO and analytics person watch the data like hawks. Revenue starts trending UP.
Day 2: CTO feels like they’re leaving money on the table and immediately demands that 50% of the traffic be accelerated
Day 5: CTO is now convinced and moves all traffic to the Accelerated segment.

Scenario 2: The bad one (AKA the one we don’t talk about)

Day 0: CTO demands an A/B test to prove the value. The CTO is a bit nervous and wants to start with a small % of traffic accelerated (say 5%).
Day 0.5: CTO does not hear any complaints from clients and the site does not crash.
Day 1: CTO and analytics person watch the data like hawks. Revenue starts trending DOWN.
Day 2: CTO feels like they are losing money and immediately demands the test be shut down.
Day 3: CTO loses interest and the battle to capture his/her mind starts again.

Although I much prefer scenario 1, both may be deeply flawed. For acceleration tests — particularly those looking at the value of FEO — to be valid, they need to take into account statistical significance, long-term segment isolation, and the challenges of re-segmentation.

Before we get into the problems, let’s start with the basics: actually getting the data…

Challenge 1: Integrating your analytics tool with your platform so that you can actually get the data you’re looking for

Problem: The segmentation platform has the job of basically separating the users into A/B buckets, whether that’s with a solution like ours, or with back-end code. But the platform isn’t the analytics tool. So some other tool (often Google Analytics or Omniture) needs to integrate the results of the segmentation and show you reports, stats, and other things you’re used to seeing from analytics.

Solution: When setting up the test, you need to make sure the analytics tool you’re using to look at the results integrates well with your platform and can give you the data you’re looking for. At Strangeloop, we built integration with Google Analytics right into our product because we have a lot of customers using GA. Those customers are able to see their numbers displayed like this:

Or even simpler displays like this:

We’ve also helped our customers integrate our platform into Omniture by revising their existing analytics code to collect our segmentation data.

Once you’ve figured out how to collect the data, you run into other issues…

Challenge 2: Ensuring that users stay in the same segment

Problem: We have to keep making sure the user stays in the same segment, over and over again, especially if the A/B ratio isn’t going to change.

Why? Two reasons:

First, because the benefits of accelerating pages aren’t just one-time benefits. They have a lifespan over a user’s entire relationship with your site. We know that satisfied users are more likely to return to your site, and we also know that when a user group experiences slow pages, it takes about six weeks for your repeat-visitor traffic from this user group to return to normal. If you’re trying to understand user behaviour, but you’re serving users accelerated pages on one visit and unaccelerated pages on the next, you’re not going to gather useful data.

Second, because the browser cache has a memory, and what you put in browser cache changes the entire long-term user experience. If you serve a user an accelerated experience on their first visit, and then put them in the Unaccelerated segment on a return visit, they’re not getting a truly unaccelerated experience because their browser has already cached a bunch of optimized resources.

Solution: When setting up the test, ensure that users who are put in the Accelerated segment on their first visit are served accelerated pages on their return visits, and ditto for those in the Unaccelerated segment.

Challenge 3: Tracking users before, during, and after testing

Problem: You need to be able to determine whether the user you’re looking at in your analytics tool visited the site for the first time after you start the experiment or before. Here’s why: Let’s say Bob was on my site two days ago, before I started my A/B test yesterday. Bob comes back to the site today, and I put him in the Accelerated segment. In my analytics tool, Bob shows up as a return visitor. Under the rules of acceleration, I should be doing really well with Bob because acceleration would have really primed his cache. But his first visit was before I was doing good caching. So, in the grand scheme of things, Bob is actually a first-time-accelerated visitor, and not really a repeat visitor. I can’t consider him a user who is supposed to have a good cached-content experience.

Solution: Pragmatically, Bob’s experience is very similar to that of a new visitor who’s visiting the site for the first time today, and that’s the bucket we should probably consider him in.

Challenge 4: Tracking and optimizing users before and after re-segmentation

Problem: This poses a similar challenge to #3. Let’s say Alice visited my site for the first time two days ago, a week after I started my A/B test, and segmentation decided to put her in the Unaccelerated segment. But I changed my A/B ratio yesterday, and when Alice arrives back today, I move her to the Accelerated segment. Again, my analytics tool will probably show her as a return visitor, but she has the same problem as Bob in the previous example: her cache isn’t affected by all the cool cache thingies that my acceleration solution does.

Solution: We can do one of two things:

Treat Alice like a first-time visitor (and somehow make sure the analytics tool can tell re-segmented users apart; Google Analytics doesn’t have a way that I know of), or
Have the segmentation platform keep Alice in the same segment forever. If we go with #2, we’re basically only segmenting brand-new visitors, and once we pick a segment, they’re staying there. In this case, we should keep in mind that the simple ratio of visitors in segment A to segment B won’t really reflect the A/B ratio we’ve configured in our segmentation platform (but the ratio of brand-new visitors should).

Challenge 5: What about multivariate (MVT) or A/B/C/D testing?

Now that you have an idea of the complexities of A/B testing, consider the headaches of MVT testing. What if we decided to do an A/B/C/D test (A=Unaccelerated; B=Accelerated with one set of techniques; C=Accelerated with another set of techniques; D=Accelerated with yet another set)? Then these problems become much more complicated.

Now, let’s say you’ve figured out all these challenges. Then we have to turn to the sixth and defining issue: statistical significance.

Challenge 6: Ensuring that your test data is statistically significant

At the top of this post, when I defined the good and bad testing scenarios, this is something that neither of the CTOs bothered to look at.

Problem: Many customers want to see that implementing our acceleration solution has a meaningful impact on conversion rates as reported by Google Analytics, Omniture, or similar tools. There’s often a lot of variation in conversion rates, so actual data can look something like this:

Date	Accelerated	Unaccelerated
Nov 15	3.78	3.77
Nov 16	3.77	3.68
Nov 17	3.88	3.70
Nov 18	4.96	4.77
Nov 19	2.53	2.28

So how do you take this set of numbers and determine if they prove a true correlation between acceleration and conversions?

Solution: There’s a statistical test, called the chi-squared test, that can be used to determine if data is statistically significant. Without getting into the math, I’m going to show how to use a free online tool called the Split Test Calculator to get answers. (Before I do that, though, I want to give a major hat tip to Ken Jackson, one of our senior engineers here at Strangeloop, who developed this methodology and wrote a fantastic post about it for our internal blog, which I’m more or less cribbing here.)

First, you’ll need to collect and aggregate data from Google Analytics on the number of visits (or visitors) and the number of goal transactions for both the Accelerated and Unaccelerated segments over the date range of the experiment. The following table shows data for Nov 15 – Nov 19:

	Accelerated	Unaccelerated
Visitors	485882	55423
Goal Transactions	18985	2093
Total Conversion Rate	3.91%	3.78%

Then plug the numbers into the form in the Split Test Calculator using the Accelerated numbers for Group A and the Unaccelerated numbers for Group B.

In this case, the Split Test Calculator indicates that there’s no clear winner at a 90% confidence level. In other words, there’s more than a 10% chance that the difference between 3.91% accelerated and 3.78% unaccelerated is just due to variations in the data and is not statistically significant. A useful feature of the calculator is that it then estimates that a clear winner might be determined if data for an additional 101703 visitors is collected – just a couple more days of data based on the transaction rate for this site.

A 90% confidence level might be considered too high for this type of experiment and the chi-squared test can be used at lower confidence levels. For this data the difference is significant at a 85% confidence level. In other words, there is a less than 15% chance that the difference in conversion rates is due to variations in the data.

Key takeaways

Use the Split Test Calculator to determine the number of visitors you need to be confident in your results. Be disciplined and don’t draw conclusions until you have a viable sample.
If you’re planning to perform an A/B test for any acceleration that involves something that affects the browser cache, then ensure you have a good plan to not mix up your user groups.
Similarly, if you plan to change the % of segmentation during the test, ensure you have a good plan to not mix up your user groups.

As I said at the top of this post, this is wisdom we’ve gleaned over years of testing, and there’s still a lot to learn. I’d love to hear your experiences and the lessons you’ve learned along the way.

Web Performance Calendar