Web Performance Calendar

The speed geek's favorite time of year
2018 Edition

Magic numbers

by Gilles Dubuc
ABOUT THE AUTHOR
Gilles Dubuc

Gilles Dubuc (@fullstackjerk) is a senior software engineer at the Wikimedia Foundation.

Guidelines like RAIL are popular in the web performance community. They often define time limits that must be respected, like 100ms for what feels instantaneous, or 1000ms for the limit of acceptable response time.

Prominent people in the performance community keep telling us that there’s a lot of science behind those numbers.

I’ve always been skeptical of that claim, and earlier this year I set out to find out if there’s any merit to those numbers by doing an extensive literature review of web performance perception academic research. Here are some of the findings from that project.

Following the citation trail

If you follow paper citations, some classic papers keep showing up as references. And in the world of web performance, two papers get cited a lot more than any other.

Response Time in Man-Computer Conversational Transactions by Miller, 1968 and Response Times: The 3 Important Limits by Nielsen, 1993/2014.

Nielsen essentially takes some of the numbers from the Miller paper, brushes the dust of off them since they were pre-web and presents them in a simpler fashion that everyone understands, stating that they apply to the web. What Nielsen doesn’t do, however, is prove that those numbers are true with research of any kind. Jakob Nielsen is simply stating these limits as facts, but no science has been done to prove that they are true. And ever since, the entire web community has believed what a self-proclaimed expert said on the matter and turned it into guidelines. Surely, if an authoritative-looking man with glasses who holds a PhD in HCI states something very insistingly, it must be true.

Jakob Nielsen grinning at how gullible the web community is, probably

Trust me, I know things!

What about the Miller paper? After all, if Nielsen insists that those principles are an absolute truth that hasn’t changed in 50 years, maybe it’s because Miller’s research was so compelling to start with? I think everyone who believes that the numbers found in RAIL and similar guidleines are real should read the Miller paper, the origin of these pervasive magic numbers. Not only Miller doesn’t back up any magic number stated with any research of any kind – it’s really just a giant subjective essay – it contains gems that Nielsen didn’t seem to find useful to include in his cleaned up version of it:

If he has made an error that the system can detect, he should be allowed to complete his segment of thought hefore he is interrupted or told he is locked out. After two seconds and before four seconds following completion of keying in his “thought,” he should be informed of his error and either “told” to try again, or told of the error he made.
Comment: It is rude (i.e., disturbing) to be interrupted in mid-thought. The annoyance of the interruption makes it more difficult to get back to the train of thought. The two-second pause enables the user to get his sense of completion following which an error indication is more acceptable.

Miller advocates to intentionally delay errors by a whole 2 seconds, in order to avoid disturbing the user’s train of thought. If it sounds silly and dated, it’s because it is, just like the rest of Miller’s paper. Like Nielsen’s, it means well, but pulls magic numbers out of thin air. Not a single experiment was conducted, not a single human being studied or surveyed in the making of these magic numbers. No research data to verify the claims.

What happens when you do real science

Are 100 ms Fast Enough? Characterizing Latency Perception Thresholds in Mouse-Based Interaction by Forch, Franke, Rauh, Krems 2017 looked into one of the most popular magic numbers from the Miller/Nielsen playbook: 100ms as the treshold for what feels instantaneous. Here’s the key result of that study:

The latency perception thresholds’ range was 34-137 ms with a mean of 65 ms (Median = 54 ms) and a standard deviation of 30 ms.

This is quite different than the 100ms universal treshold we keep hearing about. The study goes on to show that subjects with a habit of playing action video games tend to have a lower threshold than others. Showing that cultural difference can affect that limit.

RAIL Magician

Googler revealing the next iteration of RAIL guidelines

When you think about it, it does make sense that the real threshold is a range that depends on demographics, and that there’s no reason there should be a universal threshold that happens to be a round number. It would be all too magical, wouldn’t it?

Proving universal facts about mankind based on students down the hall

Students in class

Can you spot the person younger than 19 or older than 36?

A major weakness in a lot of papers doing real science I’ve reviewed, however, is that when actual research on people is done, it’s usually on a group that lacks diversity. It’s often whoever scientists have easy access to. Typically students from the same university. They’re subjects that are educated, proficient with technology use and often with a monetary incentive to participate, which obviously skews the results. And yet, after performing a study on a dozen paid students, these research papers will often claim to have proven a universal truth about all human beings.

This is actually true of the study I quoted earlier about the 100ms threshold, with the minor difference of students earning credits rather than money. Here’s their description of study participants:

Twenty students (10 female, age 19-36 years, M = 23.45, SD = 3.32) which were recruited via the local psychology student mailing list took part in the experiment. All participants had normal or corrected-to-normal vision and normally used their right hand for handling computer mice. Participants signed an informed consent sheet at the beginning of the experiment and received partial course credit for participation.

Another very common weakness of studies I’ve found is that they’re often performed in labs using fake browsers, predetermined browsing scenarios, or by having people watch videos of page loads. All of which are very disconnected from the real experience of browsing the web.

Overall we should remain skeptical of studies’ results when their experimental setup was questionable in those ways. While the 100ms study disproved the 100ms universality myth with just 20 people, it remains insufficient to prove that the different numbers that emerged were any more universal.

Everything sucks, now what?

Beyond magic numbers, my literature review revealed that very little real science has been done about web performance perception in general.

It is disappointing to find out that we don’t know much about web performance from a scientific perspective. WPO stats might contain a lot of compelling-looking case studies, but the detailed data behind those is rarely, if ever, shared. And they’re usually about how performance improvements may drive sales, without answering fundamental questions about whether things feel fast to users. Additionally, when performance improvements don’t result in sales or traffic increases, they don’t become a case study or something people announce proudly, which results in a self-selecting bias of industry stories of that nature.

My reaction to these disappointing findings from the literature review was to start working on original research of my own, on real Wikipedia users, as part of my work as a member of the Wikimedia Performance Team. The first results of which will be published early next year. I encourage the web performance community to do the same. The lack of science is a solvable problem, anyone can do original research and publish the data alongside the findings, so we can all make progress together on understanding how people truly perceive performance. And maybe we’ll be able to come up with new guidelines based on numbers backed by science.

Photo credit: Doc Searls, Tulane Public Relations CC-BY-SA 2.0