Goodhart's law in action: 3 WebPerf examples

Goodhart’s law in action: 3 WebPerf examples

1stDec 2024 by Noam Rosenthal

ABOUT THE AUTHOR

Noam Rosenthal (@nomster.bsky.social) is a software engineer for Google Chrome, working on rendering & speed metrics. Co-editor of multiple webperf specs. A seasoned web developer in his past.

Real world examples of how over-optimizing for metrics can be at odds with performance.

“When a measure becomes a target, it ceases to be a good measure”
An adage attributed to Charles Goodhart, a British economist.

Overview

In web performance, Goodhart’s law surfaces when optimizing for metrics and optimizing for UX might lead to opposite conclusions. As a developer working on browser speed metrics, I’ve encountered this on multiple occasions. It made me feel a combination of sadness and curiosity. Sadness, because it’s a hurdle in the way of making the web faster for everyone. Curiosity, because this is a really interesting challenge, technically but also in terms of communication and knowledge. So I wanted to share 3 examples of Goodhart’s law that I’ve encountered, and try to learn from them.

Cases

Case 1: Lighthouse Cheating

See GoogleChrome/lighthouse#15829 for example. Some snake oil “performance experts” are offering “100 score in Lighthouse”, while what they do in practice is provide a trimmed-down experience based on detection of Lighthouse (in the user-agent or some such). The trimmed down version can include a different set of images to load, and a dramatic reduction in scripts, which results in a much higher score.

Offering this kind of tool is a web performance dark pattern, providing people with filtered spectacles when they can’t afford to change the color of reality. As well as not actually improving anything, it’s worse that it destroys the usefulness of Lighthouse as a way to suggest improvements that could actually help.

It is a result of Goodhart’s law. Most web developers do not have the knowledge and bandwidth to look at web performance holistically, so they fall back to industry standard metrics. Since any metric is flawed, being an approximation, optimizing for the metric is often easier than optimizing for what it is supposed to measure.

Case 2: LCP & INP blind spots

Over the last few years, we’ve had to deal with more than a handful of “blind spots” in Core Web Vitals, and find bespoke solutions for some of them. For example, excluding low-entropy big images from counting as “large” in the LCP case, and INP no-ops: handling events asynchronously to separate the presentation feedback from the event.

Those holes become a case of Goodhart’s law only when people deliberately rely on them for better “performance”. For example, some people suggested to immediately yield for every critical input. While this might quiet down some events contributing to INP, it slows down interaction feedback as a whole, resulting in poorer user experience. In more complex cases where the main thread is very congested, this would also inevitably not help with the metric itself, as more and more input feedback would be delayed by the general congestion and not by the event-specific feedback.

Case 3: Framework benchmarks

Before the era of real user monitoring, people relied a lot on “classic” benchmarks, where some script runs multiple instances of the same operation using different techniques to compare them. A recent example I came across is JS framework benchmark, which is apparently used by all the major and minor DOM frameworks/libraries to benchmark their performance.

This is not bad in itself, and classic benchmarks have their place in the set of performance tools, mainly as a way to catch regressions early in hot-path APIs. In fact, this particular benchmark has prevented browser regressions in DOM APIs on several occasions. Thanks!

However, it becomes a problem when Goodhart’s law comes into play – when frameworks might over-optimize for how fast it takes to sort a huge list of very simple elements, overlooking much bigger performance pitfalls like unnecessary layout and parsing.

Summary

Seems like there are lots of cases of optimizing for the metric instead of optimizing for what it stands for. Do those metrics “cease to be good”? I wouldn’t say so categorically. Instead, I would urge developers who work on performance to always be aware of Goodhart’s law, and to remain honest with themselves and with their customers about whether they’re optimizing for the experience or for the metric.

With an optimistic outlook to 2025, I think the web performance community is, slowly but surely, maturing from a place where it over-optimized for “passing” core web vitals (or lighthouse before that) to a place of understanding the complexity of how metrics are (and can only be) an approximate representation to user experience, the value of improving their user experience, and how the metric is a tool and not a goal.

One Response to “Goodhart’s law in action: 3 WebPerf examples”

Web Performance Calendar » SUX sells, but how to sell SUX? December 5th, 2024
[…] their actual user experience metrics tell a completely different story (see also, Noam’s blog post about Goodhart’s Law in action on the anti-patterns this […]

Web Performance Calendar