Why you're probably reading your performance measurement results wrong (but at least you're in good company)

Why you’re probably reading your performance measurement results wrong (but at least you’re in good company)

24thDec 2011 by Joshua Bixby

ABOUT THE AUTHOR

Joshua Bixby (@JoshuaBixby) is president of Strangeloop, which provides website acceleration solutions to companies like eBay/PayPal, Visa, Petco, Wine.com, and O'Reilly Media. Joshua also maintains the blog Web Performance Today, which explores issues and ideas about site speed, user behavior, and performance optimization.

One of my favourite books of 2011 was Thinking, Fast and Slow by the Nobel Prize-winning psychologist Daniel Kahneman. In his book, Kahneman identifies the two systems of thought that are constantly warring inside our heads:

System 1, which is fast and intuitive, and
System 2, which is slow and logical.

Almost invariably, System 1 is flawed, yet we helplessly rely on it. We also have a painful tendency to think we’re applying System 2 to our thinking, when in fact it’s just an intellectually tarted up version of System 1.

Kahneman offers a nifty little test of this thinking:

“A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days?â€

The larger hospital
The smaller hospital
About the same (that is, within 5% of each other)

The correct answer is B, the smaller hospital. But as Kahneman notes, “When this question was posed to a number of undergraduate students, 22% said A; 22% said B; and 56% said C. Sampling theory entails that the expected number of days on which more than 60% of the babies are boys is much greater in the small hospital than in the large hospital, because the large sample is less likely to stray from 50%. This fundamental notion of statistics is evidently not part of people’s repertoire of intuition.â€

But these are just a bunch of cheese-eating undergrads, right? This doesn’t apply to our community, because we’re all great intuitive statisticians? What was the point of that computer science degree if it didn’t allow you a powerful and immediate grasp of stats?

Thinking about Kahneman’s findings, I decided to conduct a little test of my own, to see how well your average friendly neighbourhood web performance expert is able to analyze statistics. (Identities have been hidden to protect the innocent.) *

The methodology

I asked 10 very senior and well-respected members of our community to answer the hospital question, above. I also asked them to comment on the results of this little test:

The following RUM results capture one day of activity on a specific product page for a large e-commerce site for IE9 and Chrome 16. What conclusions would you draw from this table?

The results

If you had to summarize this table, you would probably conclude “Chrome is faster than IE9.â€ That’s the story you take away from looking at the table, and you intuitively are drawn to it because that’s the part that’s interesting to you. The fact the study was done using a specific product page, captures one day of data, or contains 45 timing samples for Chrome is good background information, but isn’t relevant to the overall story. Your summary would be the same regardless of the size of the sample, though an absurd sample size (i.e. results captures from 2 data points or 6 million data points) would probably grab your attention.

Hospital question results:
On the hospital question, we were better than the undergrads… but not by much. 5 out of 10 people I surveyed got the question wrong.

RUM results:
I was amazed at the lack of focus on the source of the data. Only 2 people pointed out that the sample size was so low that no meaningful conclusions could be drawn from the results, and that averages were useless for this type of analysis. The other 8 all focused on the (assumed) fact that Chrome is faster than IE9, and they told me stories about the improvements in Chrome and how the results are representative of these improvements.

Conclusions

The table and description contain information of two kinds: the story and the source of the story. Our natural tendency is to focus on the story rather than on the reliability of the source, and ultimately we trust our inner statistical gut feel. I am continually amazed at our general failure to appreciate the role of sample size. As a species, we are terrible intuitive statisticians. We are not adequately sensitive to sample size or how we should look at measurement.

Why does this matter?

RUM is being adopted in the enterprise at an unprecedented speed. It is becoming our measurement baseline and the ultimate source of truth. For those of us who care about making sites faster in the real world, this is an incredible victory in a long protracted battle against traditional synthetic tests.

I now routinely go into enterprises that use RUM. Although I take great satisfaction in winning the war, an important battle now confronts us.

Takeaways

1. We need tools that warn us when our sample sizes are too small.
We all learned sampling techniques in high school or university. The risk of error can be calculated for any given sample size by a fairly simple procedure. Don’t use your judgement because it is flawed. Not only do we need to be vigilant but we need to lobby for the tool vendors to help us. Google, Gomez, Keynote and others should notify us when sample sizes are too small – especially given how prone we are to error.

2. Averages are a bad measure for RUM results.
RUM results can suffer from significant outliers, which make averages a bad measure in most instances. Unfortunately, averages are used in almost all of the off-the-shelf products I know. If you need to look at one number, look at medians or 95th percentile numbers.

3. Histograms are the best way to graph data.
With histograms you can see the distribution of performance measurements and, unlike averages, you can spot outliers that would otherwise skew your results. For example, I took a dataset of 500,000 page load time measurements for the same page. If I went with the average load time across all those samples, I’d get a page load time of ~6600msec. Now look at the histogram for all the measurements for the page, below. Visualizing the measurements in a histogram like this is much much more insightful and tells us a lot more about the performance profile of that page.

(If you’re wondering, the median page load time across the data set is ~5350msec. This is probably a more accurate indicator of the page performance and much better than the average, but is not as telling as the histogram that lets us properly visualize the performance profile. As a matter of fact, here at Strangeloop, we usually look at both median and the performance histogram to get the full picture.)

* Of course, you’re allowed to call into question the validity of my test, given its small sample size. I’d be disappointed if you didn’t. 😉

Web Performance Calendar