Pat Meenan wrote a nice piece last year about WebPageTest’s testing consistency, which among a myriad of handy practical advice, stated: “Live sites can have noisy performance characteristics so we usually run anywhere from 5 to 9 runs …”.

While this sounds reasonable, I wanted to quantify the impact of the number of runs on WPT results’ precision. This information can be valuable to anyone relying on WebPageTest to assess their own and their competitors’ websites, which is just about everyone in the web performance community…

Test Setup

Leveraging the WebPageTest APIs, I submitted a total of 35 runs for different sets of popular websites and computed the number of runs that would have been needed to achieve a given precision. Tests were performed for key performance metrics: TTFB, startRender, domContentLoaded, onLoad events as well as speedIndex, for both Desktop (Dulles:Chrome.Cable) and Mobile (Dulles_MotoG:Motorola G – Chrome.LTE) agents, using the public instances of WPT, first view only.

Median Values Results

In the tables below, a reading of say X runs for TTFB for precision 3% means that the median TTFB values (for the 10 sites tested) reported by any number of runs greater than X would yield results within 3 % of the reading. In other words, if you want a precision of 3% or better for TTFB, then you ought to run at least X runs.

Desktop Experience, WPT Instance = Dulles:Chrome.Cable

desktop-retail

desktop-biz

Mobile Experience, WPT Instance = Dulles_MotoG:Motorola G – Chrome.LTE

mobi-retail

mobi-biz

Conclusion

This series of test confirms our intuition that the number of runs needed to achieve a predetermined precision for a website depends on the nature of the site and the metric under consideration.

This being said, some interesting trends emerged, which should hold true for a majority of websites, for a given WPT agent.

Desktop

  • For a desktop experience, the default 9 runs yielded following precision:
    • TTFB: around 6% to 8%
    • Other metrics: better than 6%
  • If 3% precision or better is sought, then 20 or more runs are recommended.
  • If a 10% precision only is sought, then 7 runs should be enough.

Mobile

  • For a mobile experience, the default 9 runs yielded a 3% or better precision.
  • If a 5% precision only is sought, then 5 or more runs should be enough.
  • If a 10% precision only is sought, then a single run should do.

Comments

Mobile tests seem to require fewer run for a given metric to achieve a given precision, compared to desktop. Although I am not entirely sure what is causing this behavior, it is possibly due to the large number of third party calls typically found in desktop experiences.

I hope this post will encourage website owners to first calibrate their websites and applications to compute the relationship Precision vs. Number-of-runs for the metrics they want to track. This is particularly important if business decisions are made based on performance results.

References:

Graphs above were computed by taking the median number of runs for the top 10 sites per category below. I only gathered data for precisions 1, 3 , 5 and 10 percent, and used smooth rendering. I ran each test 3 times (in parallel) and took average values.

Top 10 retails sites, per Alexa ranking: Amazon.com, Ebay.com, Netflix.com, Etsy.com, Walmart.com, Steampowered.com, Ikea.com, Bestbuy.com, Target.com, Homedepot.com

Top 10 business sites, per Alexa ranking: Paypal.com, Office.com, Alibaba.com, Espn.com, Chase.com, Skype.com, Indeed.com, Bankofamerica.com, Wellsfargo.com, Forbes.com

ABOUT THE AUTHOR

Pierre Lermant photo

Pierre Lermant is an Enterprise Architect at Akamai Technologies, specialized in improving the performance, availability and operations of customer-facing web applications.