Using Diagnostic Metrics

30thDec 2021 by Annie Sullivan

ABOUT THE AUTHOR

Annie (@anniesullie) is a software engineer with Google. She is passionate about building a better performing web for users across the globe. Her tenure as a Googler spans 16 years, with experience on the toolbar, docs, web search, and chrome teams. Annie currently leads development of the Core Web Vitals metrics. She lives in Michigan with her husband Doug and two sons, and enjoys tinkering with laser cutters, metal etching, and new cooking techniques.

User experience vs. diagnostic metrics

A few years ago, Google introduced the Core Web Vitals metrics. If you’ve been following web performance for a while, you know there are already a lot of metrics. Dozens and dozens of them, and new ones are being experimented with continually. So it’s not surprising that one of the main questions we get from web performance enthusiasts about Core Web Vitals is how we chose the metrics, and why we didn’t include various other metrics.

Usually the answer to these questions goes back to goals. The goal is to provide developers with a small number of key metrics which represent critical user experiences on the web. All of the metrics follow a set of guiding principles:

They apply to all websites.
They represent a direct facet of the user experience.
They are measurable in the field.

Most performance metrics don’t directly measure a user experience in a way that applies to all websites. Instead, they’re diagnostic metrics: they help diagnose a user experience problem by narrowing it down in some way, instead of measuring that user experience directly.

Diagnostic metrics by example

Let’s use page load metrics as an example. Largest Contentful Paint (a user experience metric) directly measures when the main content of the page is displayed to the user. We’ve been asked why we didn’t also include Time to First Byte, which measures the time between the request for a resource and when the first byte of a response begins to arrive. This point in time during the page’s load isn’t visible to the user. But it can be extremely useful as a diagnostic metric to help you understand the cause of a slow page load–and whether you should focus on the initial network request and server response, or the content after it starts being loaded by the browser. Still, the fact that this point in time isn’t visible to the user is important. TTFB doesn’t account for design decisions site authors may have made. For example, a site using Server Side Rendering may purposefully trade a slower Time to First Byte for a faster overall page render.

Does that mean that Time to First Byte is a bad metric? Of course not! It measures an important milestone in the page load, and whether you’ve chosen server-side rendering or client-side rendering you still may want to optimize and monitor it. But it’s not measuring a user experience, so you should only focus on it if improving it improves the user experience of your site.

As I said earlier, there are dozens and dozens of metrics like this. And even though they can all provide useful information, you can’t possibly monitor and improve all of them all the time. You’ll need to choose the most important ones to focus on for your site, and focus on those. So how do you pick?

Choosing Diagnostic Metrics

As you work on improving your site’s user experience over time, you’ll evolve the set of diagnostic metrics you focus on, and how you use them. But wherever you are in that process, here’s some high-level guidance on how to think about what metrics to look at next.

Start with the user experience problem

Diagnostic metrics help you diagnose a user experience problem. So start with a problem! We made the Core Web Vitals metrics to give you a starting point. Take a look at how many users on your site have experiences beyond the thresholds for “good” to get started.

And of course there are more ways to measure user experience than just the Core Web Vitals, and if your site has a more product-specific user experience metric that needs improvement you can focus on that instead. Here are examples of user experience metrics that Airbnb and Facebook developed.

But maybe you’ve just solved a problem. In that case, you probably had to carefully optimize some specific areas of your site, and you want to keep those optimizations as new changes to the site are made. Diagnostic metrics can help here too, allowing you to monitor those areas in detail so they don’t regress.

Think about your use case

One thing my team thinks about a lot in designing metrics is the use cases in which developers will view the metrics and try to improve them. There are three main use cases we think about for web developers trying to improve the performance of their site:

Local debugging: Sitting in front of the site (or a WebPageTest run of the site) with some tools and trying to understand what’s happening right in front of you. This is usually where web performance work begins. Local debugging tools have many, many diagnostic metrics available, which can be both helpful and overwhelming.
Lab regression testing: Running automated tests on a continuous build. These tests try to prevent changes that make the site worse again after you’ve improved it.
Field monitoring and debugging: Understanding the experience users are having on the site as they browse. Field monitoring is difficult because there are a lot of complications to getting accurate performance metrics in the field (privacy restrictions on the data browsers will collect, limitations of analytics tooling). But it’s important to understand if users of your site are having a good experience.

Pick a strategy to start with

Diagnostic metrics don’t provide a complete picture of the user experience; instead they narrow down on a part of it. There are three major strategies they take to narrow things down, and each can be applied better in some use cases. Hopefully thinking through your problem and use case can help you choose the best strategy for your situation.

Summation diagnostics

Summation diagnostics show how individual parts add up to a whole. This is usually the first place to start when digging into a problem. There are a lot of different things we could consider summing to understand a problem. Here are some examples:

We could look at specific points in time that add up to when the Largest Contentful Paint happened. Time to First Byte is one of those points. If we see that most of the time until Largest Contentful Paint is before the first byte of the HTML response, we know that we’ll want to break that time down into smaller parts that sum up to the whole, probably using the components in the NavigationTiming API. Similarly, we could break down moments after the first byte if we saw that most of the time until Largest Contentful Paint occurs after it.

Summation diagnostics don’t need to sum up timings. For Cumulative Layout Shift, it’s more useful to break down individual shifts. Start with the largest shift, and reduce it from there.

Similarly, going back to Largest Contentful Paint, we look at the time spent in subsystems, which sum up to the whole time to LCP. How much time is spent in downloading JavaScript? Running JavaScript? Downloading images? CSS?

Thinking back to use cases, it’s easy to see that summation diagnostics are great in the lab. They show you how things break down, which helps you both pinpoint problems and figure out which of the many lab metrics to look at next. If you’re not sure how to understand your site’s performance, summation diagnostics in the lab are the best place to dig in.

If you’re seeing really different numbers in the lab and the field, summation diagnostics can help you identify problems that users experience which aren’t happening in the lab. But be careful to understand how your data is aggregated. Let’s say you want to dig into what’s happening on the 10% of page loads with the worst Largest Contentful Paint. You can’t just look at the worst 10% of Time to First Byte to split out whether the problem is before or after the first byte, because those pages might be a different 10% than the ones with poor Largest Contentful Paint. Ideally you can look directly at how summation diagnostics break down as a percentage on each page load. But if you’re not able to do that, perhaps you can at least use Slicing to split out page loads above some specific value of Largest Contentful Paint and look at summation metrics for those separately.

Proxy diagnostics

Sometimes the thing you want to measure is difficult to measure directly. There might not be a way to get the metric in all browsers, or maybe you can get the data but it’s really noisy. Proxy diagnostics are often used as a proxy for the metric you really want in those cases. For example, maybe you are starting to build a site with a lot of images. The site’s not live yet, so you can’t predict exactly how long it will take for users to be able to see every image. Still, you know that the smaller the byte size of the images, the faster they should load. So you can measure and optimize that directly.

Proxy metrics can be really fantastic for lab regression testing. A metric like “were all images well compressed” doesn’t directly measure how long the images took to be visible, but it’s much less noisy. And it’s very actionable for preventing regressions. You can set up continuous integration which checks for uncompressed images. Similarly, while bytes of JavaScript downloaded doesn’t fully measure main thread blocking, it’s very easy to monitor for big jumps and catch unnecessary code being added before it lands.

Slicing diagnostics

Often when you’re doing field monitoring, you’ll see some users having poor experiences, and you get some hints from summation diagnostics, but you can’t quite figure it out. You want more clues to figure out how to reproduce the problem. Slicing diagnostics break up a metric based on some context. You can imagine splitting up Largest Contentful Paint by effective connection type or system RAM. You could also split it by which element was the largest contentful element painted—maybe sometimes users are seeing a different one than you see locally. The same thing goes for Cumulative Layout Shift—you could split by the most shifted element. This article has some great examples of using Google Analytics and BigQuery to do some really helpful splitting diagnostics.

Putting it all together

I hope you found this view of diagnostic metrics useful! Here are some resources to help you implement a strategy that works for your use case:

This article on custom metrics goes over web performance APIs you can use to generate all sorts of diagnostic metrics for any of the use cases I covered above.
If you want to dig deeper into what breakdown metrics you should look into, I recommend learning about the critical rendering path and why things are often blocked on the main thread. This will help you understand the reasons behind diagnostic metrics available in web performance tools, and which ones likely apply to your site.
This article on measuring and debugging field performance gives some details about how to implement slicing strategies.

Web Performance Calendar