What goes into making a new Web Vital metric

12thDec 2022 by Annie Sullivan

ABOUT THE AUTHOR

Annie Sullivan (@anniesullie) is a software engineer on Chrome's Web Platform team. She is passionate about building a better performing web for users across the globe. Her tenure as a Googler spans 17 years, with experience on the toolbar, docs, web search, and chrome teams. Annie currently leads performance metric development on Chrome. She lives in Michigan with her husband Doug and two sons, and enjoys tinkering with laser cutters, metal etching, and new cooking techniques.

A question I get sometimes is what goes into making a new Web Vital metric. It’s a lot more involved than you might expect!

We start by deciding what to measure.
We figure out the building blocks for the metric, and validate that they’re solid.
We put the building blocks together into a score, and evaluate it.
We work on developer guidance, incorporating feedback as we go.

Deciding what to measure

First we need to decide what to measure. For Core Web Vitals, we’re focused on directly measuring the user experience. We want to measure what the user of a web site is observing–things like when the content is visible, when the response to an interaction occurred, etc. Note that this is excluding other types of metrics that are valuable in other contexts. Some specific examples:

Core Web Vitals don’t include diagnostic metrics which give developers more technical details on what could be going wrong in their sites. These metrics are important, but there are a lot of them, and which to use depends on your use case. I won’t cover them here, but I wrote last year about how to integrate diagnostic metrics into your workflow.
Core Web Vitals don’t include user behavior metrics like bounce rate, session length, and rage clicks. These are important metrics which are often tied closely to a site’s business. But you need to understand your site’s business to know which ones matter to you and how much.

So what do we measure? User experience metrics which apply to all sites: things like page load time, responsiveness to user input, frustrating content shifts. When we think about what to measure next, we look at user experience research, user complaints, and areas of developer concern. We also want to keep the overall set of metrics quite small, so we’re not splitting developers’ focus. If a new metric better represents a specific user experience than an existing one, we’ll replace the existing one. If we find that a new metric measures a user experience well, but the user experience isn’t a big problem for web content today, we won’t move forward with it.

We also try to plan ahead for the standards process–what can we measure in Chrome, and what could be specified and supported in other browsers. We might start out with some unknowns here at the beginning, before we narrow down how we will measure.

Starting with the building blocks

When we think of an area we want to dig into, we start by looking at what signals we can expose in the browser. These are the building blocks of a new metric.

If we’re measuring at what is shown to the user, we look at what we should expose about when the browser paints content and when it renders frames to the screen. If we’re measuring user input, we look at what we should expose about how the browser processes input. Since we generally need to tie events that happen at different stages of the rendering pipeline together, we spend a lot of time digging into traces, ensuring that we’re understanding and correctly reporting how they tie together in a variety of circumstances. We start by doing this manually, getting feedback from the code experts in design and code review. We often automate a large variety of sites, and then check the metric values for outliers and dig into the traces.

We eventually come to the key building blocks of the metrics. For Interaction to Next Paint, the key building block is a user interaction. For Cumulative Layout Shift, the key building block is an individual layout shift. When we’ve got the building blocks defined and we’re pretty confident that we’re measuring the right thing, we scale it up a bit.

We add the signals to Chrome’s diagnostics to get a broader view of what they measure; we find pages with outlier and unexpected values and dig in and try to reproduce and understand if they’re correct.

We present our findings to the W3C Web Performance Working Group along with proposals for how the signals could fit into the Performance Timeline. We work to specify and add these building blocks directly to the performance timeline when possible–the individual layout shifts for Cumulative Layout Shift, the individual interactions for Interaction to Next Paint, etc. That way sites can collect the same metrics and potentially modify and adapt the high level metrics to their needs. For example, here’s a great talk from the Excel team on how they’re using interactions in the Event Timing API–they aggregate differently than Interaction to Next Paint does, and we’re happy it suits their use case. Similarly, we based Largest Contentful Paint on the Element Timing API, so that if sites want to measure the timings of different elements on the page, the underlying APIs provide that flexibility. But there are limitations to what can be added to the performance timeline, as it has implications for privacy and security. Yoav Weiss gave an excellent talk on the topic, if you’re interested in learning more. This is why sometimes we’re able to report more about the contribution of third parties to performance in the Chrome User Experience Report and in lab tooling than we can in the JavaScript APIs.

Putting the blocks together into a score for the page

Each Core Web Vital metric reports a single score for each page load. But usually there are multiple building blocks in a page load. For example, during a page load there can be many paints. We wanted to understand which one might be most meaningful to the user. The last image? Largest text? We brainstormed and evaluated several options, eventually coming to Largest Contentful Paint.

We start this process by coming up with several ideas for summarizing the building blocks into a score. We think about statistics like average, max, and various percentiles. We look at windowing approaches. We weigh the pros and cons of each idea, and implement several permutations of each. Then we usually do some manual evaluation. The process can look something like this:

Come up with several user journeys we want to measure well.
Record videos and chrome traces of the journeys.
Have users rate the videos of the journeys.
Compute each metric permutation using the trace for each journey, and see which ones rated the journeys more and less like the users.

From there, we usually end up with a few top candidates, and some thoughts about why they matched the user ratings better than the others. We take these top candidates and add them to Chrome diagnostics, so that we can do a larger-scale evaluation. We rank all the sites by how they scored on each permutation of our aggregation, and look at the sites that were ranked most differently. Which score was more fair? Which represented the user experience better? From this we gain a deeper understanding of the tradeoffs of the different aggregations.

We also look at outliers with very good or very poor scores to make sure they’re not caused by errors in the metric.

We also make sure to take a look at sites in a variety of different verticals, along with sites using a variety of different technologies, to ensure we’re not missing problems specific to certain categories of sites.

After this, we have a metric definition!

Developer guidance and incorporating feedback

Once we have a metric definition, we still have a lot of work to do to collect and incorporate feedback.

We start with some developer guidance:

We determine a threshold for the metric. As we’ve previously written, we look at two things when defining a threshold. First, a review of user experience research relevant to the metric, to determine what an ideal user experience would be. Second, an analysis to understand what’s feasible for the best web content currently. Sometimes device quality or network connections prevent even the best quality web content from consistently meeting the ideal user experience. So we look at what is consistently achievable by the best web content, and take that into account when setting thresholds.
We integrate the metric into our tooling so developers can try it out. As an example, here’s the tooling integration for Interaction to Next Paint.
We write documentation for the metric on https://web.dev.

Then we collect and incorporate feedback:

We listen to feedback from developers who are measuring the metric for their sites, and working to improve this aspect of their user experience. We get feedback through channels like the web-vitals-feedback@googlegroups.com list and discussions with partners. We listen to the difficulties they encountered, the tooling and documentation they’d like, and the impact on their business metrics.
We continue to present our progress to the web performance working group, and get feedback there as well. The feedback we get from other browser implementers is especially helpful for ensuring the details of the API and implementation are well designed and portable to other browsers.

And that’s how we come up with a new metric. But even after that it doesn’t stop and we continue to monitor and improve the metric. Developer feedback and bug reports are critical for developing, and iterating upon, a metric, so really would encourage all web developers to let us know their feedback as much as possible! Feel free to send general feedback to web-vitals-feedback@googlegroups.com, or if you find a specific bug and want to show us a minimized test case, you can report it directly to the chromium bug tracker.

Web Performance Calendar