Web Performance Calendar

How using Server-Timing API helped bring > 70% perf improvement

Nishu Goel — Mon, 01 Jan 2024 00:32:26 +0000

When working on the web, we all end up coming across situations where the experience for the user seems a bit janky. This happens either when you are dogfooding, or the users actually highlight facing that slowness, or you see that for yourself in your performance metrics or tools.

What one’d do in such a situation is look at the obvious things first: are the images loaded lazily? Is the bundle size too big? Is the main thread blocked? How is the performance panel recording looking? Is the data coming from the backend taking more time than it should?

If it’s one of the first problems, we’d fire up our favorite devtools and try to find out the root cause. If it’s the data being slow, we could see how long that is taking in the network requests duration. And, if it does end up being slow from the backend, the next step would be to dive into some logs and checking what piece of code in the request endpoint takes the time, or embedding some performance information in the response, or using other performance tools etc.

We had a similar situation at epilot where we use Elasticsearch to look for all related products of an order for example, and also all the other entities the order itself might be a relation to.

This meant, looking for direct relations of the order but also the reverse relations of that order.

Now in the Elasticsearch world, this meant:

Getting the list of related products of the order.
Searching through the documents in elasticsearch for all those related IDs.
Also, searching all other docs for having this one order as a relation to this (the reverse relation example, I shared above).
Finally returning all the found relations.

Showcasing two products as relations of an order

Entities and relationships

This worked great for the first few years and got us exactly what we needed quickly. However, as we scaled, the amount of documents indexed in ES increased too, and that meant, the step 3 above started getting expensive and slow! The logic was to look for an entity ID in all the indexed documents and find if it is a relation or not. For orders with a lot of products, the endpoint would respond in seconds.

We had to fix the logic/ update the design asap.

The first solution was to update the ES query to make it more specific about what is being looked up, and then check if it improved with that query change and how the results looked for the entities with many relations. This meant looking into a crazy amount of Cloudwatch logs and looking at the total time taken and if it was any better.

We made a change to search query to be an IDs specific query which would look through only the IDs in the indexed docs, and not through all the fields in those ES documents.

GET /_search
{
  "query": {
    "ids" : {
      "values" : ["1", "4", "100"]
    }
  }
}

The expectation was this would completely solve the problem and be super quick. This wasn’t the case unfortunately.

What we were missing was to see which part of the code was really adding to the time taken by the endpoint. We needed to observe the different steps in the process and see all the sections where the time taken could be reduced. We needed to see if it was really ES that was the culprit.

This is where Server-Timing API came into picture. Something that would break down the request into pieces and give us an idea about what’s expensive. The API allowed to pass the timings specific to a request from the backend to the browser. This meant that we could now see in the network requests, how the response times looked like, which part of the function took more time and so on.

To support this, we need to send a Server-Timing response header from the backend and time the methods that we needed. Something like this:

HTTP/1.1 200 OK

Server-Timing: db;dur=53, app;dur=47.2

The Server-Timing header can send three bits of information:

metric name
duration
description of the metric

In the above example:
db — metric name
dur=53 — the time in milliseconds the db metric took (This is for instance, the time to fetch some data from the Database)
app — another metric name and dur=47.2 being the time taken for the same.

The header can take multiple metrics separated by commas delivering great information and is super lightweight. Although it is recommended to keep the metric names as small as possible.

An example of what the returned Server-Timing header looks like:

db; dur=142.715967; desc="getRelationsForEntity",flt; 
dur=36.777609; desc="filterEntityListByAccess",pgn; 
dur=22.96549; desc="getPaginatedRelations",hyd; 
dur=0.64605; desc="hydrateRelations",ddp; 
dur=0.13311699999999999; desc="dedupeRelations",total; dur=721.388338

And this information then translates into a more visual and helpful view as a part of the network timings tab of the request OR available via the PerformanceServerTiming interface.

To access the three available properties, they correspond in the PerformanceServerTiming as:

“name” -> PerformanceServerTiming.name
“dur” -> PerformanceServerTiming.duration
“desc” -> PerformanceServerTiming.description

Here’s the working draft with more information.

Timings tab to see server response timings

These are not available as a series of timings as such and just simple metrics and therefore, do not show in a waterfall pattern.

This means we can now tackle both the frontend performance problems and the backend slowness problems all in one place. No jumping to the logs to find a response was slow or heading backing again to the devtools to find the frontend problem.

We could also read the data now available using the navigation (PerformanceNavigationTiming) and resource timing (PerformanceResourceTiming) APIs and send it our analytic tools to create metrics/monitors.

const performanceObserver = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    log('Server Timing', entry.serverTiming);
  }
});

performanceObserver.observe({type: 'navigation', buffered: true});

Below is one of the examples from a method taking different amounts of time at different steps in the request. We will set up together the whole setting up and measuring the server response times later in this article.

Server response timings for a request endpoint in Browser devtools

This was especially helpful for the Elasticsearch example that I shared earlier as we needed to know which change was improving what part of the implementation and if it was moving us closer to improved response times. The Server-Timing header does not directly improve the timings but helps find out what could be worked towards.

The uniqueness of the API is that its super flexible and allows the server to communicate important information to the browser, even other than the server timings. It would be amazing to have this supported on Safari as well, which currently lacks the API support.

Coming back to how we came down to the API response in 400ms seconds as compared to `1.8s earlier`

These are the two versions of the API. When we measured the performance with server-timings on the ES implementation after switching to the IDs search query on Elasticsearch, the performance seemed to be taking all the time, especially on this function searchRelationsForEntity .

Server timing indicating huge time taken by method searchRelationsForEntity

Now we knew this was to do with ES but where exactly. We timed specifics inside this method.

On measuring further, it was entitySearch.searchEntities that was taking up the most time so we needed to find a solution that would not go through all the ES documents at all to get the relation entities.

This is where changing the design made sense, and we moved to a graph table design with Dynamo, which would store all the relationships as mappings, and return all the entity IDs whether as direct relations or reverse.

Now we could choose if we wanted to search ES with all the IDs that we have or just batch get the entities from the db, and support pagination (this was an added advantage with ES) etc. separately.

Now, this complete dynamo approach saved us looking through all the indexed documents for the IDs that we need, and also cut some slack on Elasticsearch which would have this unnecessary load, eventually affecting other straightforward searches as well.

This is how the server timings looked with the DynamoDB approach:

Server timing indicating reduced times after switching to Dynamo to fetch entity relations

This was an amazing improvement, clearly visible with this significant performance information right in the browser devtools for us. This also ended up helping improve the code in general, to gain some more milliseconds by refactoring, and using performant ways to filter/hydrate/paginate things etc.

Implementation of Server-Timing as a middleware for AWS Lambda

Now if you look up the existence of libraries supporting server-timing header, there are so many but most of them for Express and NodeJS.

server-timing
server-timings etc.

We needed to create one for the serverless use case which could be used for AWS Lambda, lets say to use with middy for example.

Step 1. Create a `withServerTimings()` middleware to use for AWS lambda

import middy from '@middy/core';
import * as Lambda from 'aws-lambda';

export const withServerTimings = () => {

  return {
    // add Server-Timing header to response
    after: (handler: middy.Request) => {
      const response = handler.response as Lambda.APIGatewayProxyStructuredResultV2;
      try {
        const headers: unknown[] = [];
        /**
          This is where the headers with different metrics are be saved
          and retrieved from
        */
        const timings = getServerTimingHeader(headers);
        response.headers = {
          ...response.headers,
          'Server-Timing': timings,
        };
      } catch (e) {
        Log.debug(`Middleware Error: Could not record server timings`, e);
      }
    },
  };
};

This creates a usable withServerTimings() middleware to then be used as:

Step 2. Apply the middleware to the request handler

return middy(handler)
  .use(...)
  .use(withCors())
  .use(withServerTimings())
  .use(...)

This already ensures that your backend will now be sending a Server-Timing header in the request endpoints.

Next step would be to actually time our operations and store them as a common timing variable which can then be sent to the getServerTimingHeader method for example.

Step 3. Implement timing for your methods

const times = new Map>();

// starting the timer on a method
const startTimer = (name: string, description?: string) => {
  times.set(name, {
    name,
    description: description || '',
    startTime: process.hrtime(),
  });
};

// ending the timer and recording the duration of the method
const endTimer = (name: string, description?: string) => {
  const timeObj = times.get(name);
  if (!timeObj) {
    return console.warn(`No such name ${name}`);
  }

  const duration = process.hrtime(timeObj.startTime as [number, number]);
  const value = duration[0] * 1e3 + duration[1] * 1e-6;

  timeObj.value = value;
  times.delete(name);
  return timeObj;
};

// finally setting the metric value received from endTimer object
const metric =
  typeof description !== 'string' || !description
    ? `${name}; dur=${dur ?? 0}`
    : `${name}; dur=${dur}; desc="${description}"`;

This would then be used as:

Step 4. Usage of timers of the methods

startTime('db', 'getRelationsForEntity');
const relations = await getRelationsForEntity({
  ...
});
endTime('db', 'getRelationsForEntity');
  .
  .
  .
startTime('pgn', 'getPaginatedRelations');
const { relations: paginatedRelations, hits } = getPaginatedRelations({
  ...
});

endTime('pgn', 'getPaginatedRelations');

And there we go, we will have the timed methods and the server headers looking like:

db; dur=139.99; desc="getRelationsForEntity",flt; dur=36.68; desc="filterEntityListByAccess",
pgn; dur=39.37; desc="getPaginatedRelations",hyd; dur=0.61;
desc="hydrateRelations",ddp; dur=0.130; desc="dedupeRelations",
total; dur=741.53

Visual representation of the server-timing headers

All the timing implementation and setting up the middleware can be a hassle, and should be a one-time job. I created a package out of this implementation called lambda-server-timing:

lambda-server-timing (npmjs.com)
GitHub – NishuGoel/lambda-server-timing (github.om)

withServerTimings to use a middleware
startTime to start measuring a function/set of functions
endTime to end the timing of the function/set of functions and get the metric set.

import {withServerTimings, startTime, endTime } from 'lambda-server-timing';

Let’s time our functions down!

References

INP meets Puppeteer

Tsvetan Stoychev — Sun, 31 Dec 2023 23:17:53 +0000

Introduction

You are probably wondering: “What a user centric metric like INP has to do with Puppeteer? Isn’t Puppeteer a tool that lives in the synthetic world?”, or why “Tsvetan, who works on mPulse, a Real User Monitoring system, is messing with synthetic tools?”.

The truth is that mPulse and the other Real User Monitoring systems are great about giving us an early warning that something is going wrong but we still need to “zoom in” and understand the root cause.

In this article I would like to walk you through an approach that we developed in order to help a customer of mPulse. We wanted to estimate how a specific change in a JavaScript file would affect the INP score. The customer’s website was very complex and there were many factors that were affecting the INP. E.g. complex DOM, CSS animations, reflows, DOM updates on user interaction, a JavaScript library that does bot and threat detection and more.

I won’t name the customer but with the help of my imagination I will wrap a nice story.

As a result of our work the customer implemented a small fix for one of the problems. This happened in the middle of November and we noticed an improvement of INP by a few percent globally. It’s not a massive improvement but it’s a move in the right direction.

Together with our customer we identified an easy fix for one of the cases where a reflow was affecting the INP. We waited a few days in order to get over 10 000 page hits for each of the page types (Homepage, Flights Search and Flights Listing) and then we compared the “before vs. after” results.

Below is a table that shows the INP improvement of the top 5 countries for Homepage, Flights Search and Flights Listing.

Page Type	P10	P25	P50	P75	P95
Homepage	6%	12%	10%	8%	4%
Flights Search	7%	10%	11%	12%	9%
Flights Listing	10%	16%	17%	14%	15%

I quickly learned that the INP fixes seem to be complex and seem to require a lot of brainstorming and planning from the engineering team that maintains a website.

My prediction for 2024 is that the web developers will be spending more and more time in the Developer’s Tools – Performance tab and will be mastering the art of improving the responsiveness of their websites.

And finally before I dive in I hope that this article will inspire and will save time for other web performance geeks that are decoding the mysteries of INP

The problem of the imaginary customer

Earlier this year we added monitoring of INP to Boomerang JS and we started collecting INP data together with a few customers that were interested. One convenient thing we do in mPulse when we work on such proof of concepts is to start sending data from Boomerang JS to mPulse even before we are ready to display the new data points in the mPulse dashboards. Then with the help of Data Science tools and specially crafted SQL we query the mPulse RAW data and we create statistics and analyze trends. This helps us mostly to understand the data and to get an idea how we should display the data later in the mPulse dashboards.

A customer that sells airplane tickets online (let’s pretend for the sake of the story telling ) and with whom I frequently meet noticed very high INP values in the CrUX data and asked us if we could find more details in the RAW mPulse data.

We ran a few reports with our Data Science tools and we found interesting things.

Clearly, there was something going on with the input#datepicker element.

Credits to my colleague James Bricknell for setting up the Data Science tools, helping me to generate the INP chart and brainstorming together with me.

The RAW data/Boomerang beacons

In the collected mPulse RAW data we could find the following:

et.inp – the recorded INP duration.
et.inp.t – a timestamp marking the INP start time.
et.inp.e – CSS selector of the DOM element that a user interacted with.

An example of a RAW Boomerang JS beacon:

{
    ...
    et.inp: 128,
    et.inp.t: 8465
    et.inp.e: "input#datepicker"
    ...
}

Root cause analyses

We knew that the high INP values were caused when a visitor was clicking/tapping on the date picker input:

We could give the new LoAF API a try to further understand why this element was causing an issue, but initially it was more convenient to stick to the tools available in the browser as we were comfortable with those and felt they could identify the issue. Together with our customer we started recording and analyzing Chrome Performance traces.

A lot was going on the page when the date picker was being displayed. There were a bunch of AJAX calls, tracking pixels were sending events, CSS animations, new DOM elements were inserted and bot detection algorithms were running in the background.

The performance traces were really noisy and everything that was happening on the page was adding a few milliseconds here and there to the INP.

We decided to start reducing the noise:

From Chrome Dev tools we blocked a few requests that were loading the Google Tag Manager and other telemetry related scripts.
With Chrome Local overrides we commented out a few CSS animation rules.
With Chrome Local overrides we commented out a few JS code paths that were causing DOM manipulations and reflows.
We work on modern developer machines but the real users of our customer were mostly using Android phones. That’s why we artificially slowed down the CPU 4 times in order to see clearly in the performance traces what was affecting the INP.

The Pretty Print functionality of the Chrome Dev Tools was really helpful when modifying CSS and JS files because the CSS and JS files were minified and of course indentation was missing.

For example, this is how minified source code looks like:

And this is how the same source code looks after applying Pretty Print

Every time when we were changing something we were loading the page and we were clicking on the date picker input and then we were writing down the INP value that we were getting from the official Core Web Vitals Chrome extension.

Let’s measure (manually)

One of the performance bottleneck candidates was the function getBoundingClientRect() which was called on interaction with the date picker element.

A good list of things that cause reflows can be found here: https://gist.github.com/paulirish/5d52fb081b3570c81e3a

First we measured 10 times the INP on the original page without any modifications but later we measured 10 times a version where the getBoundingClientRect() call was removed.

Note: For the final fix the engineers of our customer substituted getBoundingClientRect() with something that doesn’t cause reflows but we just wanted to explore what is the maximum possible room for optimization here.

Our setup was

4x slowed down CPU.
Removed getBoundingClientRect() with the help of Chrome Local Overrides.

We were repeating this manual check about 10 times in order to calculate the 75th percentile.

* We all know that if we measured 100 times but not 10 times we would have more confidence in the numbers but in this case measuring 10 times was just fine.

Puppeteer for the win

We had a few more ideas that we wanted to validate but doing this manually was not convenient anymore. It was time consuming and we just wanted to try new ideas as quickly as possible.

So … why not using Puppeteer and scripting everything we were already doing manually?

I created and shared boilerplate Puppeteer project that automated everything we needed: https://github.com/ceckoslab/inp-measure-puppeteer

And here are a few of the challenges that had to be solved:

There was a Cookie Consent popup. We had to bypass it.
We had to slow down the CPU 4 times.
Just in case we had to simulate the User-Agent and viewport size of a popular Android mobile device.
We had to use an overwritten version of some of the customer’s assets.
We had to simulate interaction with the element in question in order to cause a slow INP.
We had to measure INP.

1. Bypassing the Cookie Consent popup

The Customer was using Optanon Consent and we figured out what cookies are being created on visitor consent. We identified 2 cookies: OptanonConsent and OptanonAlertBoxClosed and we created a simple helper function that was creating these cookies with Puppeteer:

module.exports = class Cookies {

  constructor() {};

  // TODO: Modify cookie values, domain and path

  static async setOptanonConsent(page) {
    await page.setCookie({
      "name": "OptanonConsent",
      "value": "ADD YOU COOKIE VALUE HERE",
      "domain": ".example.com",
      "path": "/",
      "secure": true,
      "sameSite": "None" // or 'Strict' or 'None'
    });

    await page.setCookie({
      "name": "OptanonAlertBoxClosed",
      "value": "2023-11-07T12:55:33.827Z",
      "domain": ".example.com",
      "path": "/",
      "secure": true,
      "sameSite": "None" // or 'Strict' or 'None'
    });
  }

};

2. Slow down the CPU 4 times

It’s self explanatory but we had to call the following function:

page.emulateCPUThrottling(4)

3. Simulating an User-Agent and screen size by a popular Android phone

We created the following helper function:

// Define the device properties for Samsung Galaxy A51
const GalaxyA51 = {
  name: "Galaxy A51",
  userAgent: "Mozilla/5.0 (Linux; Android 10; SAMSUNG SM-A515F) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/12.0 Chrome/79.0.3945.136 Mobile Safari/537.36",
  viewport: {
    width: 1080 / 2,   // The width of the viewport in pixels. Divide by device scale factor for actual pixels.
    height: 2400 / 2,  // The height of the viewport in pixels. Divide by device scale factor for actual pixels.
    deviceScaleFactor: 2, // The device scale factor.
    isMobile: true,    // Whether the meta viewport tag is set to mobile.
    hasTouch: true,    // Whether the device supports touch events.
    isLandscape: false // Whether the device is in landscape mode.
  }
};

module.exports = class DeviceEmulation {

  constructor() {};

  static async emulate(page) {
    await page.emulate(GalaxyA51);
    await page.emulateCPUThrottling(4);
  }

};

4. Override customer assets

Below is the snippet that allowed us to override the customer’s assets. In the example we are “waiting” for a JavaScript file with name example-js-of-interest.js . When such a request was intercepted we were serving not the original file content but the content of a file located in overrides/example-js-of-interest.js on our local file system.

 // Enable request interception
  await page.setRequestInterception(true);

  // Add event listener to intercept requests
  page.on("request", (interceptedRequest) => {
    // Check if the request is for the resource you want to override
    if (interceptedRequest.url().endsWith("example-js-of-interest.js")) {
      console.log("Intercepted and overriding: " + interceptedRequest.url());

      // Create a response from a local file
      const overrideContent = fs.readFileSync(path.join(__dirname, "overrides", "example-js-of-interest.js"), "utf8");
      interceptedRequest.respond({
        status: 200,
        contentType: "application/javascript; charset=utf-8",
        body: overrideContent
      });
      return;
    }

    // Allow all other requests to continue normally
    interceptedRequest.continue();
  });

5. We had to simulate interaction with the element in question in order to measure INP.

const elementToInteractWith = "input#datepicker";

 // Wait for the element to be present in the DOM
 await page.waitForSelector(elementToInteractWith);

  // Click the input input
  await page.click(elementToInteractWith);
  await page.focus(elementToInteractWith);

6. Measure the INP

We literally copied and reused the minified code of the of the Web Vitals library from here: https://unpkg.com/web-vitals@3.5.0/dist/web-vitals.iife.js

Instrumentation:

module.exports = class CWV {

  constructor() {};

  static async attachCWV_Lib(page) {

    await page.evaluateHandle(() => {
      // Including the CWV library
      window.webVitals = function(e){"use strict";var n,t,r,i,o,a=-1,c=function(e){addEventListener("pa..
    });
  }

}

Measure:

 // Execute JS code after the timeout
  await page.evaluateHandle(() => {
    window.webVitals.getINP(function(info) {
      if (info.value) {
        console.log("inp: " + info.value);
      }
      else {
        console.log("inp: not measured");
      }
    },
    {
      reportAllChanges: true
    }
    );
  });

The results

Run	Original	Removed getBoundingClientRect()
1	264	176
2	240	176
3	248	160
4	240	168
5	240	176
6	248	168
7	232	160
8	232	176
9	224	168
10	224	160
	P75	P75
	246	176

Great results I have to say. In the 75th percentile the INP was reduced with 70ms which is nearly ~29%.

It’s important to note that the simulated results don’t completely match the CrUX data for a few reasons:

The real users CPU sometimes are way faster or way slower than the 4 times slowed CPU we used for our experiment.
Also the suggested fix/improvement for INP was deployed in the middle of November where we see at the end November the P75 INP was reduced with 25 ms globally. This means that the fix was available only for the second half of November. I would expect even better results for a full month, so let’s see what will be the situation for December.
The engineers substituted the getBoundingClientRect() with IntersectionObserver which improved the performance. However, in the simulation where we removed getBoundingClientRect() we also missed to run a few DOM manipulations which skewed the results a bit.

The END

Congratulations! You reached the end of this article. I hope that by reading it you have gotten inspiration and new ideas. I don’t have super powers and I can’t see in the future but something makes me think that 2024 will be the year of the INP and testing whether improvements to address INP issues are having the desired effect. Automating this through the likes of Puppeteer as we have done here, can hopefully help with this.

I would like again to mention where you can find the Puppeteer boilerplate project that automated the measuring of INP: https://github.com/ceckoslab/inp-measure-puppeteer

And finally, special thanks to Barry Pollard for proof reading this article and making great great suggestions and to my colleague James Bricknel for helping me with the Data Science tools.

Lessons Learned from Building WebPerfDemo

Shane Niebergall — Sun, 31 Dec 2023 21:56:34 +0000

Last year I came across Jeremy Thomas’ demonstration of Web Design in 4 minutes where he guides the reader step-by-step through beautifying a bland web page. It was powerfully effective, and I began thinking how neat it would be if there was a similar walkthrough based on web performance.

WebPerfDemo is the result of that inspiration, but it was no easy feat. Along the way I learned a lot of lessons that may be useful to the webperf community.

Most metrics need a median

When I first started using SpeedCurve years ago, it annoyed me that 3 runs was the minimum amount of tests per url. I had a lot of urls to test, and tripling that was eating into my monthly allowance of checks.

After building WebPerfDemo, I understand why.

Every time you run a test, there is an absurd amount of variability. One page load may have had a busy network. The next load your CPU was busy in the background. Refresh again and this time the CDN was having a hiccup. This became apparent when I was testing a step of WebPerfDemo that was shaving off tons of page weight, yet sometimes the page load was slower than the previous step. How could a 2 MB page load faster than a 1 MB page? When there’s so much variability from load to load, even on the same page, it became a challenge to prove to the user that we were making an improvement.

The obvious solution is to take multiple loads of a page, and then use the median. At least 3, preferably 5. This throws out the outliers and reduces the noise, giving you more consistent data to work with.

I actually didn’t end up doing this with the demo, as I figured it would be a bit jarring for a page to load 5 times before you understood what was going on. Instead I added another metric, page weight, which was more stable than the core web vitals.

How to scale inefficiencies

Most websites are optimized for efficiency in order to scale. However I was doing the opposite. I wanted to show a horribly inefficient website to many. How can you do that without killing the server?

Luckily Cloudflare came to my rescue.

Using their Workers, I could easily fake a server side delay. By serving everything through a worker proxy I could decide whether I want to slow down an HTML response (to simulate a slow backend) or speed up an image (to simulate a CDN). This was accomplished with this snippet:

// Delay the response if necessary
if (msDelay > 0) {
  await new Promise(resolve => setTimeout(resolve, msDelay));
}

However, this solution came with its own hurdles, as if you take a look at Cloudflare’s order of operations, you’ll notice that caching is done before workers. If I was to use workers that’d mean that I wouldn’t be able to take advantage of server-side caching, and each user request would hit my measly shared server. Not ideal.

Luckily domains are cheap, and I realized that if I host my content on one domain that is cached, and then use the worker to load that content, then I’d be golden. The webperfdemo.com domain is served by the worker, and I use another domain to host the actual content, which is cached. I wonder if anyone else has had to do this in order to reorder the Cloudflare order of operations.

Browsers and Servers are inherently efficient

One of my goals was to show the difference that network compression makes. Ie, load a page with no compression, and then load it next with gzip/brotli enabled. Naturally it should shave off 50-60% of the text-based resources.

Yet doing this in the wild proved difficult. I assumed it would be as simple as stripping the ‘accept-encoding’ headers from the request, but various levels of the transport kept trying to compress the result. After some struggle, I gave up – if anyone has any ideas on how to make this work I’d love to hear it.

Browsers and servers also want to cache things as much as possible, rightfully so. However in order to make this a true demo, I needed to ensure that each step used a fresh copy of resources. I first thought about renaming each resource for each step, but that wouldn’t prevent the same step from using the cache. I then explored appending a unique query string to reach resource request, but that was messy. I ended up overriding the ‘cache-control’ headers.

Headers for most requests to ensure it isn’t cached:

cache-control: public, no-store

Header for the last step to show repeat-view performance:

cache-control: public, max-age=604800

Some Core Web Vitals are ‘lifespan’ metrics

My original plan was to showcase each of the core web vitals in the table of metrics. The Google Chrome team has provided a small javascript library that easily allows you to grab the web vitals of the page being loaded, so this should be no trouble.

But when I started playing with it I realized that some of the metrics weren’t being reported. Time to first byte (TTFB) and largest contentful paint (LCP) were reliable, but the others were reluctant to give a value. That’s when I learned about lifespan metrics – those that don’t report upon page load, but keep recording until the page is unloaded because they measure the entire lifespan of the page.

Metrics like cumulative layout shift (CLS) and and first input delay (FID) aren’t ones that report upon page load. In fact, this caused so much confusion that some people filed an issue with the web-vitals library.

I tried to mimic some interaction with the page in order to trigger the interaction to next paint (INP), but couldn’t find a solution that worked reliably. This is the struggle of lab environments that are trying to measure these metrics without a real user. Ultimately I gave up and limited the metrics I reported, to TTFB, LCP, Page Complete, and Weight, which can all be reported on load.

AI / LLMs are not going to take our jobs

I’m a backend developer – front end is my weakness. So when I needed to come up with a design for our sample page, I decided to use some outside help. In the past I’ve bought templates or hired a designer, but with the advances in artificial intelligence or large language models I decided to give it a try.

ChatGPT was my tool of choice, and I was impressed with its rough draft of a typical web page. But when I needed to make any alterations, it was evident that it was no pro. Many times I had to edit its proposed html, css, and javascript to accomplish what I was looking for. It got me 80% of the way there, but it was up to me to finish the last 20%.

Fortunately, that last 20% is the hardest part – which means that ChatGPT is not going to render web developer jobs obsolete anytime soon. Tweaking the details of a page until it is exactly what the user is looking for requires some finesse that the LLMs simply don’t have yet.

Instead, these advances should be looked at as tools that will assist us, not replace us. The sooner we accept and adopt these tools, the better prepared we will be for the future.

Tip of the Iceberg

I hope that the demo I’ve created is helpful to some, and if it is well-received I’ll plan on updating it every year. Web performance is a field that literally changes every day, and what I’ve demonstrated in my example is just the tip of the iceberg.

Thanks for reading, and thanks for doing your part in making the web a faster, and better, place.

Case Sensitive URLs

Robert Boedigheimer — Sun, 31 Dec 2023 21:44:06 +0000

Web developers often spend a lot of time and effort optimizing their web pages to perform better. One of my favorite optimizations is properly setting content expirations to specify how long the client should consider the content to be “fresh”. Here are some good references for caching MDN and RFC 9111. These techniques allow the browser to reuse content from its local cache or a shared network cache, which is often faster and reduces load on the origin server. An important thing to remember is that browsers consider URLs to be case sensitive (most web servers do as well, Microsoft IIS does not).

Mixed Case URLs

Here is an example web page that demonstrates some variations in image URL case.

Impact on Caching

Use your favorite tracing tool to see the additional downloads due to mixed case URLs.

Fiddler Classic (Windows only)
Fiddler Everywhere (Windows, Mac, Linux)
Browser devtools (F12)

Conclusion

Caching content is great performance improvement, and an easy way to ensure that browsers are not redownloading the same resource is to standardize how to properly case URLs. Production web pages are probably not including that same resource multiple times from the same page, but it is common that without rules developers will use their own style for URLs across a website. To avoid that I have just adopted using all lower case URLs. Don’t put in all the effort to cache resources, only to have them not be used due to differently cased URLs.

Measuring, monitoring and optimizing TTFB with Server timing

Vinicius Dallacqua — Sun, 24 Dec 2023 23:07:31 +0000

Some of the times the performance opportunities are easily found and are one of the usual suspects: image size; uncompressed assets; bundle size; preconnect; prefetch; etc. But sometimes the root cause is not as easily found, and those might be on the other side of the server-client boundary. But how can you address and identify those opportunities and what does the platform provide you in order to break down your network graph from the other side of the wire?

Why is my TTFB so slow?

When breaking down slow response times you may find yourself staring at a black box. Seeing the TTFB as one of the main culprits for your loading time problems, or even in some cases slow responsiveness and interactions performing data fetching after load.

Some of the teams I work with faced that very problem. When our backend responses were taking too much time to respond. Whilst the application in question uses a BFF (Backend For Frontend), most of the interactions on that BFF are through other internal services, Graphs and APIs that each could be potential candidates for that problem. Alongside cache misses and other external factors. And those being also separate teams and dependencies, how can you ensure you are reaching out to the right team about the right problems?

Using the platform to help understand TTFB

The platform has two great tools to help us breakdown and understand what happens over the network boundary and identify problems that might be hindering your TTFB.

`Performance` interface and high resolution timestamp

The Performance interface and the Performance.now method are key pieces to accurately measure function execution time for both browser and Node.js. Which is useful to help us time our operations with precision. MDN has a great section on why high resolution time is important and the difference between it and Date.now. But in summary, the Performance.now will return to us a high resolution timestamp that we can use to time the start and end of our function calls and time our total execution time for our server operations.

`Server-Timing` HTTP header

The Server-Timing header is part of the Performance API and it allows us to communicate our server metrics to the browser developer tooling when investigating our network requests. It allows us to send one or more metrics to represent our server response part of the network request, or TTFB from the web-vitals perspective.

Basic Format

The Server-Timing header is composed of a comma-separated list of metrics. Each metric in the list can have up to three components, separated by a semicolon:

Metric Name: A token representing the name of the metric.
Duration: Represented by dur=<value>. A numeric value representing the time taken, typically in milliseconds.
Description (optional): Represented by desc="". A human-readable description of the metric.

Server-Timing: name;dur=duration;desc="description"

Here’s an example of a Server-Timing header with multiple metrics:

Server-Timing: db;dur=0.53, cache;dur=0.15;desc="Cache readout", fs;dur=0.800;desc="File System read"

Besides the network tab under developer tools, your Server-Timing metrics can also be accessed via the PerformanceResourceTiming interface as PerformanceServerTiming entries.

To access those entries you can use a PerformanceObserver like shown bellow:

const observer = new PerformanceObserver((list) => {
  list.getEntries().forEach((entry) => {
    entry.serverTiming.forEach((serverEntry) => {
      console.log(
        `${serverEntry.name} (${serverEntry.description}) duration: ${serverEntry.duration}`
      );
      // Logs "cache (Cache Read) duration: 23.2"
      // Logs "db () duration: 53"
      // Logs "app () duration: 47.2"
    });
  });
});

["navigation", "resource"].forEach((type) =>
  observer.observe({ type, buffered: true })
);

Putting the pieces together

Lets create a simple function to time how long different parts of the backend request take and store it, to later convert it down to the format accepted by the Server-Timing header.

// server.js

// Time an function, fn, execution and store it under timings[]
async function time(
  fn = Promise.resolve(() => "noop"),
  { name = "some_request", description, timings = [] } = {}
) {
  const timerStart = performance.now();
  const promise = typeof fn === "function" ? fn() : fn;

  const result = await promise;

  const totalTime = performance.now() - timerStart;
  timings.push({ name, description, duration: totalTime });
  /**
   *    Example values stored in `timings` array:
   *    [
   *        { name: "db_read", description: "find user in DB", duration: 0.123 },
   *        { name: "some_request", description: undefined, duration: 0.456 },
   *    ]
   */
  return result;
}

// Simple conversion util to parse the data in the timings array into a string
// matching the accepted format for the Server-Timing header
function timingStringValue(timings = []) {
  return timings.reduce(
    (acc, { name, description, duration }) =>
      `${acc.length ? `${acc},` : ""}${name};${
        description ? `desc="${description}";` : ""
      }dur=${duration}`,
    ""
  );
}

// Simple node route handler example
export async function routeHandler({ request }) {
  const timings = [];
  const userId = request.param?.userId;
  const user = userId
    ? await time(() => myORM.users.query({ id: userId }), {
        timings,
        name: "db_read",
        description: "find user in DB",
      })
    : null;

  return new Response(JSON.stringify(user), {
    headers: { "Server-Timing": timingStringValue(timings) },
  });
}

The `time` function breakdown

This asynchronous function is used to measure the time it takes to execute a given server operation (fn). It has two parameters:

fn: The operation function or a promise. If a function is provided, it should return a promise to be awaited.
An options object containing properties representing the different options for the Server-Timing header and a timings array to store the different timings as we collect them:
- name: A string indicating the type of operation
- description: A description of the operation.
- timings: An array to store timing data for the different operations and later use to send as data for the Server-Timing header.

Now this function is but a simple example with its own shortcomings such as:

The timings array is mutated in place, which makes it harder to store timings for operations on other layers of the application.
The name property does not parse or safeguard for the fact that the Server-Timing header does not accept entry names with spaces.

For that reason I recommend using or building a helper such as the timing.server.ts from the EpicStack from Kent C. Dodds. EpicStack is a Remix community stack, but the code itself for the timing.server.ts utils is universal and can be used on any node backend as a util to quickly add server timing to your responses.

Lets visualize how our code above would look like using such a util lib to abstract away the code to manage the timings.

import { makeTimings, time } from "./utils/timing.server.ts";
import { json } from "@remix-run/node";

export async function routeHandler({ request }) {
  const timings = makeTimings("root loader");
  const userId = request.param?.userId;
  const user = userId
    ? await time(() => myORM.users.query({ id: userId }), {
        timings,
        name: "db read",
        description: "find user in DB",
      })
    : null;

  return new Response(JSON.stringify(user), {
    headers: { "Server-Timing": timings.toString() },
  });
}

The image below shows an example of how a request under the network tab would show us the breakdown of our TTFB under a Server Timing section when we click on it and navigate to the Timing tab

Server timings in the wild

Choose your backend flavour

Since Server-Timing is an HTTP Header its usage is not restricted to Node.js backends. In fact you can use it with Laravel as a middleware, with Rails as a gem or your favourite backend language and framework. As long as you provide the correct header and metrics format, the browser APIs and developer tools will be able to capture them.

RUM tools

You can utilize third party RUM tools to capture and monitor your Server Timing data in the wild and get a better understanding of your TTFB metric from your users’ perspective. Tools like RUMVision and SpeedCurve provide great support out of the box for your monitoring needs.

Security and privacy considerations

It’s important to notice however that you should avoid exposing sensitive information openly from your backend over to the ServerTiming header. See the Privacy and Security Considerations section for Server timing over on MDN for more details.

Analysis of INP performance using real-world RUMvision data

Rick Viscomi — Sun, 24 Dec 2023 02:36:20 +0000

If you’re reading this, you probably already know that Interaction to Next Paint (INP) will become the new Core Web Vital metric for responsiveness in March 2024. INP has already been talked about a few times in this year’s Perf Calendar by Brian, Ivailo, and Sander, so if you need a refresher on what it is and how to optimize it, I’d definitely recommend checking out their articles first.

This article will explore something similar to Brian’s post, in which he used RUM data to understand how INP performs on different page types across thousands of websites. For this post I’d like to take it a step further and look at not only how INP performs, but also what some of the common characteristics of slow INP interactions are—using attribution data from real user experiences.

No, I won’t be using HTTP Archive or the Chrome UX Report datasets (much). Instead, I’ve partnered with RUMvision to pore over some of the INP data they’ve collected in the field. So for the rest of this intro, I’ve asked the good folks at RUMvision to talk about their company and the dataset that they provided. All credit for everything between the tags goes to Erwin Hofman, Jordy Scholing, and Karlijn Löwik.

About RUMvision and its dataset

RUMvision is a real-time page speed and UX monitoring solution (real user monitoring) built on top of Google’s web-vitals library. As a relatively new real user monitoring (RUM) solution, and thus flexible, it began reporting on INP per template, device memory, per element, and an INP breakdown, all from launch in early 2022.

However, the Chrome development team acknowledged that INP insights were lacking, specifically which JS elements, in both first and third-party impact, were the cause of an INP. As a result, when the LoAF API origin trial was announced, RUMvision quickly recognized the added value. The RUMvision team began working on the experimental LoAF API in July 2023, and it was made available to customers who requested it in August 2023. As the first RUM solution to use this new method of tracking the INP, there is now a large dataset to dive into based on real user experiences. And it’s a game-changer!

What exact RUM data was used for this analysis?

Rick Viscomi (Google) and Erwin Hofman (RUMvision) discussed publishing the results of real user monitoring data at the performance.now() conference. As a result, RUMvision granted Google access to its records—specifically, aggregated performance data from LoAF-enabled user experiences—to assist in the writing of this article. No data on any individual, unaggregated user experiences was shared.

This article will look at data from October 1st, 2023 to December 1st, 2023. The dataset is comprised of 6,363,644 pageviews, with 3,540,750 on mobile, 2,687,336 on desktop, and 135,558 on tablet.

What is LoAF?

Long Animation Frames (LoAF) are a performance timeline entry for diagnosing sluggishness and poor INP. You can observe or query when work and rendering block the main thread, and which scripts were the potential culprits. LoAF, a revamp of Long Tasks, aims to help with this. A LoAF is an indication that the browser was congested at a specific point in time, such that it took a long time from the start of a task until updating the rendering (or until it was clear that no render was necessary).

Because busy, “LoAF-heavy” sequences can potentially cause delayed response to interactions, and the LoAF entries themselves contain information about what was blocking, e.g. long scripts or layout, LoAF has the potential to become a powerful tool, allowing RUM solutions like RUMvision to diagnose these types of performance issues.

“Have you had a look at LoAF yet? A great way to track down the long-running JavaScript issues on your site in the field.” – Barry Pollard on LinkedIn

How RUMvision collects LoAF data

As a website or webshop, it can feel difficult to configure an origin trial. However, if you’re using RUMvision it’s really simple, as the JavaScript snippet of RUMvision can be configured to enable the trial for all Chromium visitors, simply by turning it on within the settings.

As a performance monitoring solution, RUMvision is careful not to contribute to unnecessary poor page performance, which would be ironic. Therefore, it doesn’t track all LoAF entries. Instead, it focuses on those LoAFs that occur during user interaction, or to be more precise, a user interaction that resulted in a long LoAF and caused web vitals to report INP.

Here’s how it works:

Within a LoAF, there could have been multiple tasks batched together, even those caused by different JavaScript sources.
Within a LoAF that happened during an INP interaction, RUMvision only grabs the single script with the longest duration.
The information is then categorized into delay, compile, and execution time for each script.
RUMvision uses the execution time, which is often the highest of the three, to represent the pure JavaScript execution time in its Third Parties dashboard.

All other distilled information gathered is still available in RUMvision’s technical dashboards.

Overview of a third party impact dashboard within RUMvision

In terms of reporting, RUMvision also tracks other LoAF information. However, when it comes to identifying the exact third-party vendor and JavaScript file, there’s a balance. This balance is between not collecting more data than necessary and still providing actionable information.

A JavaScript file needs to have a certain number of occurrences before it’s included in the INP impact scores that RUMvision shows. This threshold ensures that when a file is identified as impacting INP, the correlation is strong and reliable. This information is crucial in determining whether INP issues are mainly caused by third-party or first-party JavaScript and identifying specific third-party scripts to focus on.

This data collection strategy by RUMvision aids site owners and other stakeholders in understanding performance issues without getting overwhelmed by the technicalities in Chrome’s DevTools. However, there comes a point where developers need to step in and act on this data, which could involve further investigations or making changes to the website’s JavaScript.

With the cases published so far, RUMvision has demonstrated success in this approach, enabling detailed analysis and drawing broader conclusions about web performance.

Analysis

Thank you RUMvision! Now let’s look at the data.

To start, we’ll try to answer some basic questions about INP performance in the field. Later, we’ll dig deeper into some of the more advanced diagnostic metrics to try to understand the common reasons why INP performance can be slow.

How does INP perform?

We already have an idea of how INP performs, thanks to public data from the Chrome UX Report (CrUX). HTTP Archive tracks the percentages of origins having good and poor INP. An origin will be assessed as “good” if at least 75% of all experiences across all pages are less than 200ms—keeping in mind that INP only represents one of the slowest out of many interactions on a page.

Distribution of origins having good/poor INP (Sources: CrUX, HTTP Archive)

As of the November 2023 dataset we’re seeing 65.5% of origins having good INP on mobile and 3.7% with poor INP. On desktop devices, the same figures are 96.7% and 0.6%. It’s clear that there’s a significant challenge with INP on mobile devices.

One limitation of the CrUX dataset is that it doesn’t aggregate user experiences together across origins. We could try to use the coarse ranking information about each origin to try to better approximate the navigation-weighted percentage of experiences having good/poor INP—but that’s exactly what the RUMvision dataset can tell us!

Histogram of all navigations’ INP values (Source: RUMvision)

The INP histogram data from RUMvision shows us how fast individual INP experiences are, combined across desktop and mobile. Coincidentally, the percentage of navigations under the good threshold is exactly the same as the percentage of origins having good INP on mobile: 65.5%.

Because we know that INP tends to be especially bad on mobile, let’s split out these navigations by device type and group them by INP score.

Distribution of navigations having good/poor INP (Source: RUMvision)

According to the RUMvision data, only 51% of navigations on mobile have a good INP while 18% are poor. The situation is much better on desktop: 86% of experiences are good, and 5% are poor. To put it a different way, if you’re an average mobile user, your chances of experiencing a good INP are about 50/50.

While there will always be differences between CrUX and RUM datasets, it’s encouraging to see the RUMvision data corroborating the trends in CrUX where sites tend to struggle most with INP on mobile.

Let’s keep digging!

When does the interaction responsible for a page’s INP occur?

Now that we know how INP tends to perform, we can start to use RUMvision’s diagnostics to help explain poor INP values. One useful diagnostic is the time when the interaction responsible for the page’s INP occurs.

Distribution of the time at which INP occurs (Source: RUMvision)

For both desktop and mobile, there is almost a 50/50 split between interactions causing INP before and after the 10 second mark into the visit. More specifically, for INP on desktop, the interactions occur within the first 10 seconds 48% of the time, and after the 10 second mark 52% of the time. For mobile, it’s 42% and 58%.

10 seconds is also when we can generously expect most pages to have already finished loading. During that time there may be a lot of main thread work to initialize the page, like parsing markup and styles, compiling and executing JavaScript, and laying out the page. Beyond the 10 second mark, though, you wouldn’t expect the page to be busy doing much of any of that work at all. So it stands to reason that when an interaction is counted as the INP after 10 seconds, it would be less affected by main thread contention—and if there’s less contention, INP should be faster, right?

That’s not quite what we see if we dig a bit deeper into the data. When we look at how INP tends to perform when the interaction is before or after the 10-second mark, the earlier interactions have slightly better performance in aggregate. One possible explanation is that when a user stays on a page for longer, they also tend to interact with it more, so there are more chances to incur a slow experience.

Another clue may be in the long tail of the distribution. 15% of desktop and 17% of mobile navigations have their INP interaction occurring after 60 seconds. Sure, maybe some users are sticking around and engaging with pages after a minute, but another theory is that this segment is influenced at least in part by SPAs. As Erwin noted in an earlier post, INP can grow over the lifetime of a user’s session on an SPA. There is an experimental web platform API to correct for that, but RUMvision wasn’t using it in production at the time this data was collected. So it’s plausible that some of these late INP interaction times are the side-effects of soft navigations.

Where is the interaction time spent?

We now know that about half of all INP interactions on mobile are slower than 200ms—but where is that time being spent? The RUMvision dataset also includes additional diagnostic data to break down the INP time into three phases: input delay, processing time, and presentation delay.

Independent distributions of INP breakdown metrics on mobile (Source: RUMvision)

The median value for each breakdown metric is 25ms of input delay, 39ms of processing time, and 53ms of presentation delay. We can’t exactly add these up and say that the median INP value is 25+39+53 = 117ms since each of these are independent distributions, but it does give us an idea of how much time is spent in each submetric.

The relative performance of these different phases is very different across the lower and upper percentiles. In the 10th and 25th percentiles, processing time is non-existent, meaning that at least 25% of INP values on mobile are the result of unhandled interactions. However, in the upper percentiles, processing time grows out of control very quickly: by 3.3x from the 50th to 75th percentiles and another 2.9x from the 75th to 90th percentiles, from 39ms to 128ms and 344ms respectively.

Similarly, input delay is another sleeper metric in the sense that it’s still relatively under control up through the 75th percentile, taking no longer than 61ms. But at the 90th percentile, we can say that 10% of INP values on mobile have an input delay of at least 238ms, blowing past the threshold for a good interaction. One possible explanation is that interactions with long processing times may be contributing to subsequent interactions’ input delays. For example, if a text input handler is not properly debounced, interactions may stack up each time a key is pressed.

Let’s make one adjustment to our analysis to only look at the distributions of INP breakdown metrics on mobile when the INP is slower than the “good” threshold of 200ms.

Independent distributions of INP breakdown metrics on mobile with slow INP (Source: RUMvision)

At first glance, it looks odd that some percentiles’ breakdown metrics don’t sum up to at least 200ms, but remember that these are independent distributions.

Now we can more clearly see that processing time is the dominant phase of problematic INP interactions. Half of them have a processing time of at least 120ms, which leaves very little time left over for any input or presentation delays.

What’s also interesting about these results is that even when INP is too slow to be considered good, no single breakdown metric stands out as being consistently slow across all percentiles. Each one always has at least 50% of navigations in which it’s not solely responsible for exceeding the 200ms threshold—but they all exceed it at some point. Input delay is longer than 200ms at the 90th percentile, processing time at the 75th percentile, and presentation delay at the 90th percentile. In other words, at least 10% of navigations with slow INP are always due to one or more of these breakdown metrics independently exceeding the 200ms threshold.

These results demonstrate the importance of task management strategies like yielding and deferring code execution to help drive down interaction delays and processing times on mobile. These results also show that presentation delay is solely responsible for slow INP performance at least 10% of the time, so rendering optimization techniques like reducing the DOM size and rendering less content on the client are still important.

Does it matter how good your device is?

One of the most fundamental tensions in computer science is the space-time tradeoff, which describes how algorithms with limited memory usage will take longer to perform a given task compared to those that use more memory. Low-end mobile devices inherently have relatively small memory capacities, so by this principle, we’d expect to see degraded INP performance.

Distribution of mobile INP performance by device memory (Source: RUMvision)

These results show that devices with 8 and 4 GB of memory make up over 90% of the dataset, 2 GB devices make up 8%, and 1 GB devices make up less than 1%. Given the massive differences in sample sizes, it’s a bit hard to compare their INP performance. So let’s normalize it.

Normalized distributions of mobile INP performance by device memory (Source: RUMvision)

This is the same underlying data, but we’re just stretching each row proportionally. So for example, we can say that 55% of navigations on 1 GB mobile devices have a poor INP. And sure enough, as the memory class increases, the relative percentage of navigations with poor INP decreases. Devices with the most memory, capped by the API to 8 GB, only have 11% of navigations with poor INP.

The great thing about the open web is that a public website can be visited by anyone on just about any device. But to provide users on low-end devices with great experiences, it’s clear from the data that there’s a lot more we can do. The good news is that the INP optimization techniques that benefit users on low-end devices will also (to some degree) benefit those on high-end devices as well.

Who are the common culprits?

The last thing we’ll look at is script attribution from LoAF. The LoAF API gives us a way to identify animation frames that take longer than 50ms to render and see where that time is spent, down to the scripts that execute and who is responsible for them. Using this data, we can aggregate the most popular script hosts to see which ones tend to have the poorest INP performance. “Most popular” here refers to all hosts that make up at least 0.1% of the total navigations.

Distribution of INP performance by LoAF script attribution host (Source: RUMvision)

There are 41 distinct hosts that meet the popularity threshold. “www.googletagmanager.com” is the most popular, but we’ve sorted the chart from poorest to least poor, so it actually appears somewhere in the middle. The host attributed to the biggest proportion of poor INP experiences is “cdn-4.convertexperiments.com” from Convert, an A/B testing tool with the tagline “Optimize for better site experiences”. 60% of the 17k navigations that attribute the INP interaction to this LoAF host have experiences slower than the “poor” threshold of 500ms.

The second host from the top is “dashboard.heatmap.com” from Heatmap, which does exactly what you think it does. The website describes it as a way to “increase revenue faster with ecommerce metrics on every element, blazing-fast website speed, revenue-tracked Heatmaps, Scrollmaps, and Screen Recordings”. 55% of the 2k navigations with this host attributed to the INP interaction have poor INP.

The third host is “cdn-swell-assets.yotpo.com” from Yotpo, which is an “eCommerce retention marketing” platform. 42% of the 12k navigations attributing INP to this host have poor INP. For what it’s worth, “staticw2.yotpo.com” also shows up on this list and only 16% of its 4k navigations are poor.

Many of the other hosts in this list are not so easily recognizable, so let’s group the results by the category of services that they provide.

Distribution of INP performance by LoAF script attribution category (Source: RUMvision)

The top three categories attributed to the largest proportion of slow INP performance are A/B testing, user reviews, and user behavior. 56% of the navigations whose INP is attributed to a script in the A/B category have poor experiences. This category only has two hosts in it: Convert, which we’ve looked at before, and “dev.visualwebsiteoptimizer.com”. The latter does pretty well on its own at 57% good INP, but it’s in poor company. The user reviews category is 35% poor and consists of the two Yotpo hosts above. And the user behavior category is 29% poor consisting of a few hosts: dashboard.heatmap.com (55% poor), script.hotjar.com (33% poor), cdn.noibu.com (20% poor), and www.clarity.ms (11% poor).

Conclusions

There’s so much more in this dataset that we haven’t explored yet, but even from this brief analysis, we’ve learned a lot about how INP performs in the field and some of the most common ways it can slow down the user experience.

We looked at the distribution of INP experiences on desktop and mobile and corroborated the CrUX dataset findings that mobile experiences tend to struggle more with INP.
We looked at when the interaction responsible for the INP happens and found a 50/50 split around the 10-second mark, with the interactions happening earlier performing slightly better.
We broke down the interaction time into its phases and found that processing time quickly becomes the most problematic area, but either of the other phases can also be disastrous for INP performance at least 1 in 10 times.
We saw the degree to which device memory plays a role in INP performance, with low-memory devices performing 5 times worse than high-end devices.
We explored how often the most popular script attribution hosts from LoAF are associated with poor INP performance. Third-party scripts that do A/B testing, user reviews, and user tracking tend to perform the worst.

While many of the takeaways here apply mainly to the web as a whole, there are also some lessons for individual site owners who may be struggling with their INP performance. Most importantly, RUM data is critical to understanding how poor the user experience may be and how to improve it. If your website is in the CrUX dataset, it can tell you how slow your INP is, but not why it’s slow. Collecting diagnostic metrics in the field like interaction time, delay and processing phases, device memory, and script attribution paints a more complete picture of the user experience and offers clues as to how to improve it.

This analysis also shines a light on where INP time tends to be spent, and what kinds of optimizations would be most commonly needed to optimize it. It’s no surprise that task management strategies like yielding play a central role in improving INP performance, but now we have a much better idea of the extent to which it matters. Even though the presentation delay might not always be where most of the interaction time is spent, it’s still problematic in many situations, and some sites will need to invest in reducing their DOM complexity and curbing their client-side rendering to get their INP under control.

INP is still a relatively new metric and many developers are just starting to look at it for the first time as it gears up to become a Core Web Vital metric in March 2024. If that’s you, the best place to start learning about optimizing INP is definitely the INP docs on web.dev. As you become more familiar with it and start to make some improvements to your own site’s performance, consider sharing your wins with the community in the form of a case study or blog post. Seeing how it’s actually done is a great way to demystify the unfamiliar.

Just one more thing…

Noam Rosenthal — Fri, 22 Dec 2023 21:25:12 +0000

Some things happen in a document during navigating away, right before switching to the new document, especially in same-origin navigation scenarios. We’re not currently measuring them in Navigation Timing. Perhaps we should?

Blind spot

This came up whe discussing the ongoing deprecation of the unload event.

Up until now, we measured the unloadEventStart and unloadEventEnd in navigation timing. However, unload is being deprecated, and regardless of that, several other things happen during this time period that are currently a blind spot in navigation timing.

The deactivation flow

What we call the “deactivation flow” or “document swap” works roughly like this, in a same-origin navigation:

We start receiving the bytes for the response. This is captured by responseStart.
We receive the headers for the new document, with a status that tells us that we’ve passed all possible redirects and this navigation should be committed, 200 in the normal case. Some HTTP response statuses don’t get us this far – e.g. 204/205.
With the new cross-document view-transitions feature, we have now an opportunity to capture the current state of the document for the transition, which might take some milliseconds.
We fire pagehide and visibilitychange events, which might introduce their own delays.
The old document is unloaded.
The new document is activated.

Note that the steps following (2) can happen in parallel to downloading the new document’s HTML. However, in the same-origin case they occur on the same thread as the new document, which might cause delay in CPU-bound scenarios.

Additional proposed attributes

To help with this, there is a proposal to add additional attributes to navigation timing and to resource timing:

PerformanceResourcetiming.finalResponseHeadersEnd: will be measured after step 2. This has been discussed in the past.
PerformanceNavigationTiming.deactivationStart: will be measured after step 3.
PerformanceNavigationTiming.deactivationEnd: will be measured after step 5.

Thoughts, comments?

Have you been facing this? Are you seeing speed issues that might stem from the deactivation flow in the wild? Are you measuring this in ways other than navigation timing? Please let us know…

Thanks and happy holidays!

Digging through Chrome traces: an introduction with an example

Annie Sullivan — Thu, 21 Dec 2023 22:04:12 +0000

Introduction to Chrome tracing

What is a Chrome trace?

If you’re just getting started learning about tracing, you should read this fantastic introduction from the Perfetto tracing documentation. I like their definition:

Tracing involves collecting highly detailed data about the execution of a system. A single continuous session of recording is called a trace file or trace for short.

The article goes on to explain that traces contain enough data to construct a timeline of events. Chrome traces contain all sorts of data, both high level and low level. Some examples include:

Network logs
Rectangles showing the coordinates of layouts and paints
Information about Core Web Vitals metrics
JavaScript sampling profiler data

The Perfetto article mentions that application code is instrumented. In Chrome, the TRACE_EVENT macros are used throughout the code to instrument wherever an engineer wants trace data. The number of different types of data combined with the fact that anyone can add trace events means there is an enormous amount of trace data. To help manage this, traces are logged with a category. If you are recording a trace in perfetto or the old chrome://tracing UI you can select which categories to record to narrow down the data to just things you care about. But most web developers instead choose to trace through the DevTools Performance Panel, which selects a set of categories on your behalf.

Tracing through the Devtools Performance Panel

When you record runtime performance in Chrome DevTools, you’re recording a trace. And recording a trace this way has some ergonomic benefits for web developers over perfetto and the chrome://tracing UI:

The trace will only include info from the current tab. Normally Chrome traces include information from all of Chrome’s tabs, as well as other processes like the ones that show Chrome’s UI.
The performance panel automatically only enables trace categories relevant to web developers, so you don’t need to comb through the long list of category names and guess the ones that might be useful.
The performance panel UI is tuned for web developers. It shows screenshots of what the page looked like throughout the recording. It includes a visualization of JavaScript execution measured via CPU sampling (flamecharts, top-down, bottom up charts, etc.). Overall, it’s designed to focus on the most common problems of web pages instead of Chrome internals.

So you can view traces right from DevTools. When would you want to use a different viewer? The main reason is that sometimes you just get a big blank gap in the performance panel trace, because the performance problem is in a Chrome system which isn’t in the list of categories DevTools has enabled in tracing. There are some great articles about this situation; one by Nolan Lawson and one by Jeremy Rose. But starting from Chrome 103, there is a new experiment to enable more of the non-DevTools relevant tracing events. Note some very slow or prolific events (those in the internal disabled-by-default categories) will not be included even with this setting, and experiments may be removed in the future. You can turn it on in the experiments section of settings:

Definitely enable this setting if you’re seeing blank gaps in DevTools performance panel!

Tracing through chrome://tracing and perfetto UI

As mentioned, even with this flag on, DevTools performance panel traces don’t include the disabled-by-default events, which can generate a large volume of data and have a big performance impact while they are running. To view those, you need to turn to alternative tracing tools.

There are some other reasons you might want to use a different trace viewer. Maybe you really do want to see multiple tabs in one trace. Maybe you want to have a more chromium-focused view, where you can easily link events back into chromium source code. Maybe you want to search and query events faster. Maybe you want system tracing events from the OS.

You may have heard of chrome://tracing, and if you have used that in the past, then you can still use it for these things. But especially if you’re just learning, you should be aware that chrome://tracing is deprecated in favor of Perfetto, which is faster, more stable, has an easier to use UI, and has much better support for custom metrics and queries. If you’re interested in tracing, I highly recommend reading the docs, or you can just go ahead and use the quickstart guide to recording Chrome traces in Perfetto.

Tracing though WebPageTest and other automated tools

You can also collect traces from WebPageTest runs! You can see the configuration options under the Chromium tab in the Advanced configuration options:

You type in a comma-separated list of tracing categories. The tracing categories are the same ones you check when starting tracing in Perfetto. The resulting trace will be linked in the first column of each individual run:

WebPageTest is amazing for performance automation, and you can even use this extension to export a recorder session as a WebPageTest custom script. But if you want to write your own tool to automate collecting traces, you can collect traces using the Chrome DevTools Protocol with automation tools like Puppeteer, Selenium, or Playwright.

Trace file format

Originally Chrome recorded traces in the Chrome JSON format, containing a long array of events. Now when you record a trace in Chrome, internally it uses Perfetto’s protocol buffer based format. This enables Chrome to record traces with lower overhead and a larger event buffer. But if you want to poke at the files manually or in a web app, JSON is easier. When you download from the Chrome DevTools, you get the Chrome JSON format, and Perfetto traces are easy to convert. I’ll give some info on the JSON file format and different ways to poke through trace files in my example app below.

Processing Chrome JSON Traces: A small custom application

The application

As my team works on the new soft navigation metrics, I really wanted an easy way to analyze a captured trace file, visualizing when soft navigations are detected and the paint events afterward. I’ll walk through how I built it and some things I learned about poking at traces and paint events along the way.

What I found myself wanting to do a lot when working on understanding how soft navigation detection and follow up paints are working is the following:

Open up DevTools performance panel and start recording
Do one or more soft navigations on the page
Look at a timeline that’s focused on those soft navigations: which were detected? When did they start? What paints did the API report as Largest Contentful Paint candidates?

If you want to try out the finished tool, you can use it here or view the source code here. The screenshot below shows an example of the timeline it generates when you give it a trace:

It shows timestamps in milliseconds since page load at the bottom.
It highlights the starts of soft navigations in green
It highlights the Largest Contentful Paint candidates in yellow
Hovering over the soft navigation start and LCP candidates provides more details.

Choosing a trace file format and UI

Perfetto is an amazing tracing system with tremendous capabilities. But for this particular project, when I looked at my requirements I felt the DevTools performance panel traces are best for this use case instead of Perfetto because:

I am only focusing on one web page at a time. I want to start and stop recording in the same tab while I’m interacting with that page, and DevTools performance panel is set up for that use case.
I want to look at filmstrips, and DevTools is already recording what’s needed for filmstrips.
If this tool turns out to be useful for reporting bugs and discussing subtleties of soft navigations, I want to make it easy for web developers to use, and DevTools helps with that.
I want to quickly look at many mobile websites, and DevTools mobile emulation supports that use case.
I want to make my viewer using HTML and JavaScript, and the Chrome JSON format that DevTools saves makes that easy.

Chrome JSON trace basics

As I mentioned before, the trace files have an array of events. When saved from DevTools it looks like this, one event per line and arguments in alphabetical order:

{"traceEvents": [
  {"args":{"argName":"argValue"...},"cat":"comma,separated,category,names","name":"eventName","ph":"M","pid":0,"tid":0,"ts":0},
  …
]}

Here are the fields in every event:

args: an event-specific dictionary of key/value pairs
cat: a comma separated list of the trace categories for this event
name: the name of the trace event
pid: the id of the process this event occurred in. You can read more about Chrome’s multi-process architecture here, but it’s okay to ignore in devtools traces if you’re just getting started.
ph: each event type has one or more phases, explained in this old documentation. We can ignore these for our example today, but the doc has details on more complex event types.
tid: the thread id of the thread this event occurred in. You can also read more about it in the above link about Chrome’s multi-process architecture.
ts: the timestamp when the event was emitted. Timestamps are in microseconds.

To make the UI above, I’ll need to find several things in the trace:

When do soft navigations happen?
What are the subsequent largest contentful paint candidates?
Can we show them on the filmstrip screenshots we see in DevTools performance panel?

So we’ll want to find events that have info on soft navigations and largest contentful paint candidates, and also the filmstrip screenshots. The easiest way is to search for event names and look at their args to see what details they contain. Let’s start digging!

Using grep or Ctrl+F

Since the trace files are JSON, you can easily grep or jq on the command line or use Ctrl+F to find what you’re looking for in your editor of choice. I prefer to do the latter since the trace can have very long lines. I format the trace in VSCode; you can do the same in your editor of choice or use an online pretty printer. I’ll show the pretty-printed results I see when I search the text. If you want to follow along with the exact trace I used, you can find it pretty printed or with the default event-per-line formatting in the GitHub repo for the project.

First: where are the soft navigation events? Ctrl+F and start typing! By the time I typed “softna” my search has narrowed down to a few events in my trace that look like this:

    {
      "args": {
        "frame": "2D636F83A7E185549B7E7C07168FB95D",
        "navigationId": "4a69b464-1e72-4cfa-a50d-3786ecf6387e",
        "url": "https://m.youtube.com/@MrBeast"
      },
      "cat": "scheduler,devtools.timeline,loading",
      "name": "SoftNavigationHeuristics_SoftNavigationDetected",
      "ph": "I",
      "pid": 69051,
      "s": "t",
      "tid": 259,
      "ts": 193220148841,
      "selfTime": 0
    },

Looking at it, each one is a soft navigation. We can see that:

ts is the timestamp of the navigation
args.url is the URL being navigated to
frame contains the frame this occurred in; if we’re digging into pages that have iframes we can account for this but let’s start by assuming everything’s in the main frame.
navigationId is a unique id for each soft navigation

Having a unique id for each soft navigation might mean we can associate it with the paints! Let’s Ctrl+F 4a69b464-1e72-4cfa-a50d-3786ecf6387e and find out! There are three matching events; I’ll paste just the final one (final candidate) for brevity:

     {
      "args": {
        "data": {
          "candidateIndex": 3,
          "imageDiscoveryTime": 7296.5,
          "imageLoadEnd": 7299.5,
          "imageLoadStart": 7297.5999999940395,
          "isMainFrame": true,
          "isOutermostMainFrame": true,
          "navigationId": "4a69b464-1e72-4cfa-a50d-3786ecf6387e",
          "nodeId": 2229,
          "size": 24250,
          "type": "image"
        },
        "frame": "2D636F83A7E185549B7E7C07168FB95D"
      },
      "cat": "loading,rail,devtools.timeline",
      "name": "largestContentfulPaint::Candidate",
      "ph": "R",
      "pid": 69051,
      "s": "t",
      "tid": 259,
      "ts": 193220415759
    },

Looks like these are exactly what we were hoping for, Largest Contentful Paint candidates 1, 2, and 3. Looking at args.data, we can see the first two are type text and the last is type image. They each have a size, which is width x height, but they don’t have any coordinates we could use to paint a screenshot. However they each have a nodeId; let’s look for that! If we just Ctrl+F and type “2229” we’ll get a lot of partial matches in timestamps. But we can use regular expression search, or just search for ” 2229,”. The node id shows up in several unrelated events like PageEvacuationJob started and ProfileChunk, but we also see this:

    {
      "args": {
        "data": {
          "dom_node_id": 2229,
          "frame": "2D636F83A7E185549B7E7C07168FB95D",
          "image_url": "https://yt3.googleusercontent.com/NP3n...",
          "is_image": true,
          "is_image_loaded": true,
          "is_in_main_frame": true,
          "is_in_outermost_main_frame": true,
          "is_svg": false,
          "object_name": "LayoutImage",
          "rect": [12, 52, 400, 52, 400, 115, 12, 115]
        }
      },
      "cat": "loading",
      "name": "PaintTimingVisualizer::LayoutObjectPainted",
      "ph": "I",
      "pid": 69051,
      "s": "t",
      "tid": 259,
      "ts": 193220402724,
      "tts": 18637038,
      "selfTime": 0
    },

So the args.data.nodeId from largestContentfulPaint::Candidate matches args.data.dom_node_id in PaintTimingVisualizer::LayoutObjectPainted. But it’s hard to find automatically because they have different names. And taking a step back, they use different naming conventions too (camelCase vs underscores). An important thing to keep in mind with traces is that they have tons of data because it’s easy for any engineer to add data, but different engineers and especially different teams don’t necessarily coordinate on trace events. So you’ll see issues like this with inconsistent naming, and even inconsistent ways to represent a rectangle, as we’ll see below.

Searching chromium codebase for details

PaintTimingVisualizer::LayoutObjectPainted–that looks like exactly what we need! There are a lot of arguments, but what we probably want is rect. But what is a rect? Why does it have 8 numbers? This is where Chromium code search comes in handy. There’s only 1 hit in the c++ code for PaintTimingVisualizer::LayoutObjectPainted. It’s in PaintTimingVisualizer::DumpTrace(). That method has the value with the rect passed in; if we click on it we can see it’s called by DumpImageDebuggingRect and DumpTextDebuggingRect in the same file. You can see those both call RecordRects() which calls CreateQuad() which makes a rectangle:

void CreateQuad(TracedValue* value, const char* name, const gfx::QuadF& quad) {
  value->BeginArray(name);
  value->PushDouble(quad.p1().x());
  value->PushDouble(quad.p1().y());
  value->PushDouble(quad.p2().x());
  value->PushDouble(quad.p2().y());
  value->PushDouble(quad.p3().x());
  value->PushDouble(quad.p3().y());
  value->PushDouble(quad.p4().x());
  value->PushDouble(quad.p4().y());
  value->EndArray();
}

And if you click on gfx::QuadF you’ll see it defines a quad by for corners:

  constexpr explicit QuadF(const RectF& rect)
      : p1_(rect.x(), rect.y()),
        p2_(rect.right(), rect.y()),
        p3_(rect.right(), rect.bottom()),
        p4_(rect.x(), rect.bottom()) {}

So that means the values in the rectangle are:

x,
y,
right,
y,
right,
bottom,
x,
bottom

Where x, y is the top left corner and right, bottom is the bottom right corner. So now we can get the viewport coordinates for each LCP!

Writing some JavaScript to parse the trace

Now we know exactly the data we want, let’s look at some quick JavaScript code to pull out the trace data that we need. Array filter and find help a lot here:

let traceJson = JSON.parse(traceData); // traceData = file downloaded from DevTools
let traceEvents = traceJson.traceEvents;
let lcpEvents = traceEvents.filter((e) => {
  return (e.name == 'largestContentfulPaint::Candidate' &&
      e?.args?.data?.nodeId);
});
for (let l of lcpEvents) {
  let p = traceEvents.find((e) => {
    e?.args?.data.dom_node_id == l.args.data.nodeId;
  }
  let top = p.args.data.rect[1]; // or rect[3]
  let left = p.args.data.rect[0]; // or rect[6]
  let height = p.args.data.rect[5] - top; // or rect[7] - top
  let width = p.args.data.rect[2] - left; // or rect[4] - top
  // Now do something with this position!
}

Some of you might feel more comfortable diving straight into code like this to help you traverse the JSON data, rather than manually searching and grepping through it. Others may be more comfortable looking through the JSON initially, especially if you’re less used to the structure. Regardless, at some point, you’ll definitely want to switch to code. (And if you find yourself really wanting to write queries instead of grepping or array processing, that’s a great time to look at Perfetto’s trace processing and metrics capabilities.

Lather, rinse, repeat!

All right, we know the coordinates within the viewport of the paint for each LCP candidate! But if we want to draw it on top of a screenshot, we also need to know:

What the viewport size is

Ctrl+F “viewport” gives us a PaintTimingVisualizer::Viewport event with same rect format as the paint rects, making it 412×915.

What the screenshot size is

Ctrl+F “screenshot” gives a whole lot of events like this:

    {
      "args": {
        "snapshot": "/9j/4AAQSkZJRgA……"
      },
      "cat": "disabled-by-default-devtools.screenshot",
      "id": "0x1b2a",
      "name": "Screenshot",
      "ph": "O",
      "pid": 15397,
      "tid": 259,
      "ts": 11391427572
    },

But notice the screenshot event has no dimensions! The args.snapshot field is a base64 encoded png. We can either render an image in HTML with a data URI like "data:image/png;base64,/9j/4AAQSkZJRgA……" or use an image library to get its dimensions. Since my app is a web page anyway, I did the former.

Which screenshot to line up with which paint

First, the screenshot timestamp doesn’t exactly match the LCP timestamp. And if we look back, the LCP timestamp doesn’t match the PaintTimingVisualizer for the node either! The reason is that displaying a frame in Chrome is a complicated process. First code in the renderer on the main thread does layout and decides how objects will paint; this code notifies the PaintTimingVisualizer as it runs, emitting a timestamp right away. Then the Largest Contentful Paint waits for the presentation timestamp, which is when the GPU believes the pixels actually appear on screen. And the code that collects screenshots does its own estimation of timestamp. The good news is that things generally line up if we anchor on the timestamp for the LCP candidate and assume the paint with a timestamp immediately before it and the screenshot with a timestamp immediately after it are the right ones.

A simple app for a custom view into a trace

I was chatting with Andy Davies after performance.now() about poking into traces like this. He suggested that writing some JavaScript to parse the JSON and using web components to build a viewer is a quick and easy solution. I tried it and I’m pretty happy with how quickly it came together. Here is the source code. Perhaps as you were reading you noticed it doesn’t yet handle iframes, out of order frame timestamps, and other edge cases. With a viewer like this, when I hit a site with an edge case like this it is pretty clear visually, since the highlight for LCP doesn’t line up correctly. So it’s easy to adjust as I go.

You could imagine adjusting this to use the screenshots in other ways. Maybe to visualize layout shifts. Or maybe you could work on ideas for a different loading metric based on the paint events. One that measures something closer to visual completeness, or some fraction of the viewport.

What about productionizing this?

I hope this blog post was a good intro into how to poke into traces, and rearrange the screenshot and painting features into something more customized for your use case. But if your use case is something that needs to scale, you should consider broadening your tooling outside of the JSON produced by DevTools performance panel.

You may have noticed that WebPageTest has bigger screenshots than the ones produced in the traces, and it can show them at 60fps granularity. If you want to have that kind of quality filmstrip without severely impacting performance, you’ll need to video record the screen instead of dumping screenshots into a trace. That’s exactly what WebPageTest does, and you may want to consider hosting your own instance.

Also you may find that writing JavaScript code to filter and link events from an array of JSON goes from simple and easy to slow and error prone the more features you include and the larger traces you want to process. At this point you’ll really want to revisit Perfetto. Its support for trace-based metrics and SQL provides a more systematic way to filter and join traces, and its Batch Trace Processor can scale your queries to thousands of traces. It also has C++ and python libraries for processing traces.

Show us your ideas!

Often people talk to me about ideas for different loading metrics or tweaks to Largest Contentful Paint. Could we use the paint rects to measure when the entire first viewport is painted instead of just looking at the Largest Contentful Paint? Could we combine adjacent paints with some heuristic? Could we combine heuristics around network and JavaScript usage with the painted rects? I’d love to see people show me how these ideas pan out on a filmstrip. Or better yet, show the Web Performance Working Group.

The Golden Rule of Web Performance and Different Performance Engineering Specializations

Alex Podelko — Thu, 21 Dec 2023 07:00:00 +0000

Performance engineering, being rather a narrow field by itself, has many well-established specializations. While main generic principles are the same, it is surprising that the overlap in specific skills is rather small and working in one specialization it is easy to miss what is going on in another specialization – making them almost isolated silos. While I am addressing rather trivial topics here, they are somewhat between these specializations and may be worth discussing to make sure that we all are on the same page.

The Golden Rule of Performance Engineering

Let’s start from Steve Souders’ Golden Rule: “80-90% of the end-user response time is spent on the frontend. Start there.” Tim Kadlec makes some interesting points in his update on the subject. Actually, this statement bothered me for a long time because it looks like wrong conclusions could be made from it without understanding all its aspects.

First of all, this statement specifies what you need to do to improve response time. It may appear trivial – basically, it is application of Amdahl’s law to the Web. I believe that the genius of Steve Souders here was not only in pointing out where we should spend efforts improving response time – but in pointing out that response time does matter by itself and we should improve it. Not that it was a completely novel idea either – but he started a real movement with his books and the Velocity conference creating a separate discipline. Because before that, practically, nobody cared. Well, of course, there were some discussions that response time shouldn’t be too long – but the time spent in the client’s browser was mostly the client’s problem. It used the client’s resources. What concerned performance professionals was the server – and its utilization. Most efforts were focused on the server side – it was the resources that you needed to provide and pay for. Performance optimization efforts were usually evaluated in how much resources were saved – and frontend didn’t impact it (except some exotic – at least then – cases; more blurred now as Tim Kadlec elaborated in the above mentioned post). Of course, business impact of performance was discussed much earlier (see, for example, Business Case for SPE) – but it wasn’t the mainstream understanding then (and, actually, even now – although it was good progress since).

The main change was understanding that improving response time has significant business value completely independent from backend costs – and bad response times may have devastating impact on business. Later it was elaborated, for example, in Tammy Everts’ book Time is Money: The Business Value of Web Performance (or her older presentation on the topic). The WPO Stats web site has many good examples. A little more background info can be found in Business Case for Performance.

Wrong Conclusions Only

The worst conclusion that may be made from the Golden Rule is that backend time doesn’t matter as it doesn’t add much to end-to-end response time. Even if we assume that backend is properly optimized and scale seamlessly (a very big if as it is often not the case), every improvement of the backend time may get huge savings in infrastructure costs. Assuming that all backend time is server processing (not waiting in a queue), it is quite possible that decreasing backend time from 5% to 4% may save 20% of infrastructure costs – while, of course, will save only 1% from response time. Optimizing the frontend won’t help with saving backend cloud costs – which is still the major concern of such disciplines as Cloud Economics and FinOps.

Unfortunately, rather few systems scale seamlessly and many have different resource limitations. When you hit these resource limitations, response times skyrockets until the system becomes unusable or crashes (resilience engineering tries to address that by making sure that systems handle it in a nicer way). In this case it is not important at all if backend time was 5% or 15% – overall response times will skyrocket as its backend part skyrocket (while frontend part of it will remain the same).

Frontend vs Backend Performance

While in some cases the difference between frontend and backend response times may get a little blurred, performance engineering for frontend and backend remain completely different specializations. And the main difference is single-user vs multi-user.

Single-user is not something specific to Web frontend – it may be for anything from desktop and mobile application to any backend component when you look at performance of a single user. It doesn’t mean that it is trivial – you still may have all kinds of multi-threading effects as execution environments and technologies become extremely sophisticated. Quite a lot of performance engineering principles may be re-used – although each technology has so many of its own idiosyncrasies that it would take some time to move between technologies even for experienced performance professionals.

Single-user performance engineering is a must in every technology – while it is slow for one user it won’t be any better for multiple users. But multi-user aspects add new dimensions – you need to think about throughput, capacity, scalability, resource utilization, contention, and many other things. Actually, most books concentrate on multi-user performance – and many are very heavy on math and queueing theory which may be not trivial to apply. Two good books to start with to understand multi-user performance implications are Every Computer Performance Book: How to Avoid and Solve Performance Problems on The Computers You Work With by Bob Wescott or Fundamentals of Performance Engineering; You can’t spell firefighter without IT Perfect by Keith Smith and Bob Wescott.

Backend Performance

I guess that term may be used only in the Web Performance crowd (as something outside their interest – although Web Servers, I guess, are somewhat in the middle and fall in both categories). People who work in backend performance (which usually focuses on application, application servers, database, containers, and similar stuff) usually referred to it just as performance – as performance engineering started as a discipline when there were dumb terminals. I was able to trace the history of performance engineering at least back to 1966 when System Management Facilities (SMF) were introduced (which is, basically, instrumentation and tracing). By the way, Response time in man-computer conversational transactions by Robert Miller was published in 1968 – starting a long conversation on what is good response time. The first performance professionals were performance analysts and capacity planners charged with efficient usage of mainframe resources.

Performance / load testing appeared in the end of 80s as a response to the spread of distributed systems that didn’t have much insight into system performance beyond system monitoring. Only way to ensure performance of the application was to test it under load – so tools were created to generate synthetic load. Designing and implementing this load was a separate craft – which needed to be supplemented by traditional performance engineering skills to properly analyze results and provide recommendations. It had more overlap with performance analysis and capacity planning (first of all in workload characterization) than with functional testing.

Cloud computing brought back centralization, specific price tag for resources, and flexibility of deployments – which increased demand for, practically, the same performance analysts and capacity planners, but under new names. They are usually referred to as performance engineers, efficiency engineers, cloud economists, and FinOps professionals. Need in traditional performance testers somewhat decreased, but Continuous Performance Testing became a new trend.

All these groups are separate specializations with limited overlap – usually only in most generic performance principles. The specific sets of skills are quite different – each group with their own interests, events, and organizations.

Holistic View vs Specializations

I always advocated a holistic view of performance and context-driven performance engineering – as performance depends on all underlying parts and their collaboration. If you address each performance silo separately, you may have gaps where performance issues happen. However, modern technologies become so sophisticated (and quickly changing) that you can’t have deep expertise across all technologies. So, we probably should talk about a combination of generic performance professionals who may see a holistic view of performance and develop a performance engineering strategy, and professionals specializing in specific areas (from database to Web Performance). Of course, we are talking rather about large systems where performance is critically important – small startups probably can’t afford dedicated performance specialists and responsibilities get spread across other members of the team.

Beyond soft-navigations: tracking your SPA’s TTFB

Erwin Hofman — Wed, 20 Dec 2023 03:50:32 +0000

SPA site owners most likely are aware of the gap in Core Web Vitals data for their site. Most metrics (FCP, LCP, FID) are only reported by the browser once, so can’t be measured even if sites were willing to put in extra effort. TTFB is similar, but could potentially be calculated for soft navigations. And reporting CLS and INP, while technically possible, comes with other challenges.

SPA site owners waiting for Core Web Vitals to fully support their sites

Introduction

In summary, Core Web Vitals metrics won’t treat SPA route changes the same way as traditional page loads (Multi Page Applications, or MPAs).

When they wrote the article linked in September 2021, Google already asked themselves what they are doing to ensure MPAs do not have an unfair advantage compared to SPAs.

One of their answers to their own question is as following:

Design new APIs that enable better SPA measurement.

Which is exactly what Google has been doing by working on and introducing a soft navigations experiment. If you’re new to soft navigations: this is the term used by the web community to distinguish SPA navigations from MPA navigations.

RUMvision

I’m a developer by origin, nowadays working both as a consultant and involved with RUMvision’s JavaScript. This JavaScript is responsible for collecting and submitting web-vitals data and additional metrics and dimensions. And our JavaScript is a layer on top of Google’s web-vitals library.

Origin trials

Soft navigation measurement support in the browser is in an experimental phase, also known as an origin trial. Origin trials allow developers, RUM providers and other web enthusiasts to try out new features and API’s and provide feedback to the people that came up with a new web platform idea. You could say it allows you to experiment with features today that could be part of a browser tomorrow.

Not every first concept is a success. Luckily, this never really happened while soft-navigation is in fact nailing it

Origin trials are an area of focus of RUMvision because we believe they offer us a way to get ahead of the curve for our customers, while also allowing us to give feedback on upcoming APIs. So far, we’ve embedded 4 origin trials.
Obviously, we only start to work on origin trials when they make sense for a RUM provider like us, so we don’t look at every origin trail, but we do strongly consider ones that can help us provide more insights to our customers.

Long Animation Frames (LoAF) API and the Soft Navigations API are examples here. And if you ask me, both are masterpieces. Origin trials and web API incubations in general are something that I have easily taken for granted in the past, but it’s very interesting to be more involved in them. And quite fun to work with too.

Soft navigations

In this article, I will talk a bit about soft navigations and especially elaborate on how we built a layer on top of Google’s web-vitals soft-nav branch.

The soft navigations explainer page is actually perfectly describing this, as well as why the community needs a soft navigation API.

The SPA web-vitals challenge

The often-heard Core Web Vitals complaint from SPA (an PWA) site owners is that it’s being penalized compared to non-SPAs.

Missing and incorrect data

The reason is that Core Web Vitals is not reporting metrics specifically for soft navigations. Well, it does, kind of. For example CLS and INP are reported through the whole page. But it typically is then attributed to the initial “hard navigation” URL, as the URL or route change isn’t picked up by Core Web Vitals APIs. This means you typically get one CLS and INP measure across the whole of the SPA, rather than one per page.

This could mean that when looking at Google Search Console, CrUX API’s, PageSpeed Insights or other data coming from CrUX, you might end up looking at data that is not really measured by what users consider page views, compared to MPAs.

“But SPAs are much better in real life”

Mainly seeing experiences by landing URL is unfortunate though, as instant subsequent page navigations is a promise that comes with most SPA frameworks (though whether that actually is true is a totally different discussion).

At RUMvision we have started measuring TTFB for soft navigations and often do see a much smaller time for these:

A screenshot from RUMvision, showing different TTFB’s for different navigation types

As soft navigation’s TTFB is showing a better value, this could indicate that SPAs are penalized by web vitals heuristics. But do they when it comes to the Core Web Vitals?

As a matter of fact, the Google team already improved the playing field somewhat by changing LCP, CLS and INP characteristics:

LCP used to re-report a removed and re-hydrated candidate. Google changed this behaviour in January 2021 (Chrome 88). Elements that were the LCP but are then removed are still considered the LCP if no bigger element is being rendered.
One of the changes of the CLS metric is that it is nowadays tracked per 5 second window session per May 2021. So pages only get the worst burst of CLS, rather than the full accumulation. This helps longer lived pages like SPAs that previously experienced an ever increasing CLS number.
The highest INP numbers are already being ignored which again benefits longer-lived pages like SPAs, as they tend to get more interactions within a single page life cycle.

So the CLS and INP metrics account somewhat for the use cases of SPAs.

However, the LCP metric is genuinely harder to pass when all other page navigations within your site are soft navigations. That’s because the LCP number will be based on those initial navigations where both DNS lookup and render blocking stylesheets weren’t cached yet, leading to a higher TTFB, FCP and LCP.

For MPAs these initial heavy hit can be offset in future page loads within the site (where the connection is already made and common site assets like CSS, JS and logos are already cached), MPAs do not get the lighter, subsequent LCPs included as so only get the full initial hit.

The web-vitals library

I do acknowledge the responsiveness challenge that comes with most JavaScript frameworks. And maybe even layout shifts due to page transitions. However, that should not be attributed to limitations on how performance metrics are collected and reported (but instead should maybe be attributed to how the site is experienced).

As mentioned previously, it is already possible to segment some of the metrics (CLS and INP) by soft navigation route, but when trying to do that through the web-vitals library, it did come with other challenges:

You can either let the library report metrics once per navigation, or report all changes and triage incoming data yourselves. This could be used to report on soft navigations in a custom way.
Unfortunately, even then some metrics (LCP, FCP, FID, TTFB) are not reported for subsequent navigations.
Even for those that are (CLS, INP) only larger values are reported, meaning you only get a partial view of these metrics for soft navigations.

Luckily, as the soft navigations API was introduced, web-vitals introduced the soft-nav branch which you might want to use instead.

This made it easier to collect LCP, CLS and INP for other navigations as well. But the TTFB would always result in 0ms.
CLS and INP of a previous page will be reported before the new TTFB is being reported. This makes it easier to send them to your analytics endpoint in a chronological way.
This branch includes a unique uuid per page as metric.navigationId. Which allows you to tie together page specific data.

In other words: the soft-nav branch of web-vitals allows you to measure soft navigations with minimal effort and attribute them back to the appropriate route.

The code below (latest at time of writing) can be used to report metrics including soft navigations.

(function() {
  var script = document.createElement('script');
  script.src = 'https://unpkg.com/web-vitals@soft-navs/dist/web-vitals.attribution.iife.js';
  script.onload = function() {
    webVitals.onTTFB(console.log, {reportSoftNavs: true});
    webVitals.onLCP(console.log, {reportSoftNavs: true});
    webVitals.onCLS(console.log, {reportSoftNavs: true});
    webVitals.onINP(console.log, {reportSoftNavs: true});
  }
  document.head.appendChild(script);
}());

Do note to use this soft-navs branch you need to either enable Experimental Web Platform features or participate in the soft navigations origin trial.

You still need to send the data to an endpoint yourselves and correctly batching together page specific data to prevent incorrectly attributed data. The web-vitals documentation shared a code example to give you a head start though, but this is what a RUM provider like RUMvision can help with.

The start of embedding SPA tracking in RUMvision

Until the web-vitals published the soft-navs branch, tracking even limited CWV metrics in SPAs needed quite a bit of customization, and even then had severe limitations.

But with the soft-nav branch of the web-vitals library being published, it became way easier. On top of that, soft navigation experiments entered a second trial. So, over at RUMvision, we decided to start working on v4 of our tracking snippet to incorporate soft navigations.

Who should fix the 0ms TTFB?

Although the web-vitals library is tackling some challenges already, it was still missing some data.

Mainly, TTFB was known to report 0 milliseconds. To be honest, all the concerns described over at developer.chrome.com are justified (they’ve spent way more time in this area afterall). Their reasoning to report a TTFB of 0ms is based on the following (in my opinion, valid) concerns:

Does a fetch request always actually happen for a soft nav?
And if it happens, which one to pick from?
And was it actually related to the navigation?
Could the LCP be painted even when the fetch request fails?

All questions above are tricky to answer and even vary per stack and framework. For a general purpose library taking an opinionated stance on this for the general use case is understandable. The Google team also tries to keep the library as tiny as possible so adding a lot of extra code to try to tackle may grow the library more than desired.

With this in mind, how could we still end up in a situation where we do collect TTFB? This is where RUM providers can make a difference. Because:

On top of the web-vitals library, we know what framework is going to run that JavaScript;
Or allow site owners to add additional pointers as to which resource should be considered the main TTFB resource.

How RUMvision added TTFB support to SPA tracking

We’ve tested different stacks here to learn more about which request should be considered the most important one. For example, which resource actually contains contents for the upcoming page transition. We tested sites running on NextJS, NuxtJS, Angular, Gatsby and custom PWA’s.

No uniform way

And even within NextJS, you can find different scenario’s when it comes to soft navigations and possible TTFB candidates:

One prefetched all data up-front;
One just started a request on click;
Another one prefetched on hover, but then did two additional calls to the same URL.

Especially the last one leaves you, well.. unsure as to which one to consider the most important request.

And while the initiatorType within NextJS will always be fetch, I also ran into link (when prefetched) and xmlhttprequest within other stacks.

There just doesn’t seem to be a uniform way of telling which resource to use for TTFB calculations. Once again confirming the web-vitals concerns and answering why they chose to report 0ms as TTFB.

Configuring per platform

Site owners could already share their tech stack with RUMvision. For example to import ServerTiming metrics and dimensions that are often exposed by the CDN, host, stack or plugins that a site is using.

NextJS

Knowing the stack means we could use this information to automate parts of TTFB analyzing as well. We just learned that the exact moment resources are loaded within NextJS but even SPAs in general can be inconsistent.
But the pathname is often predictable. A NextJS example is as following:

/_next/data/url-of-actual-page.json

As a result, NextJS is where our experimentation began as the setup proved to be the easiest. This phase was successful, so we moved on to other stacks.

Patterns for other platforms

Based on additional research within other platforms, we decided to introduce endpoint patterns per template type. Because we discovered quite soon that category pages had different endpoints than product pages.

An example we saw at a NuxtJS website is as following:

/api/categories/{3}?lang={0}&slug={2}

We already supported regular expressions as part of identifying and grouping data per page template. We extended that feature and allowed site owners to also provide a pattern for API endpoints for pages falling in that page template group.

With the pathname of a visited page in mind, our code will then translate it back into an exact API endpoint. A full transformation example can be found in our docs.

The exact (or substring) of the API endpoint will then be used to apply filtering and return the correct fetch request(s) from the list of resources. And that request will then be used for TTFB purposes.

Analyzing new fetch requests

The more technical explanation is that we observe and save all upcoming resources in the following way:

const fetchResources = [];
const resourceObserver = new PerformanceObserver(function(list) {
  list.getEntries().forEach(function(e) {
    fetchResources.push(e);
  });
});
resourceObserver.observe({
  type: 'resource',
  buffered: true
});

But such a list could easily grow into 50 or even 200 resources for a single user interaction/soft navigation. That’s why we ask site owners to either specify their API endpoint or initiatorType of the API resources (or both) to prevent our script from running into max buffer size issues.

But even that and intermediately truncating the array could leave us with more than one resource in the fetchResources array.
An example: when navigating to a product listing page, a framework could already eagerly fetch data of all products listed on that page. It’s then hard to tell which request was related to the next user interaction, if we don’t have additional patterns to work with.

Which is the reason why we introduced endpoint patterns that were described earlier.

Waiting for the LCP

But when to report the TTFB? Because the web-vitals library won’t wait for any specific resource when dispatching the 0ms TTFB. And even if the (empty) TTFB is being reported by the web-vitals library, we have no guarantees if the resource we are expecting is downloaded already.

So we intercept and delay the reporting of the TTFB until the LCP is reported.

As both web-vitals as well as soft navigations are Chromium-only API’s anyway, we thought this solution is a safe bet (for now).

Because once the LCP is known, we have the timing information of the LCP (such as startTime and the actual file). And if there was in fact a dependency on a fetch request, we can assume that it should be finished by the time the LCP is reported.

Reporting the TTFB

Once we know the LCP, we can finalize the TTFB as well. That is the moment we will transform the current URL into an API endpoint pattern, loop through the remaining fetchResources and retrieve one or multiple entries.

With those entries, we check which one of them were fully done downloading before LCP starts. We do this by comparing the entry.responseEnd with the lcp.attribution.lcpResourceEntry.startTime (or its value if attribution wasn’t enabled).

That could still result in multiple TTFB candidates. We decided to pick the last one matching the above clause.

Having a single TTFB candidate at this moment, we supplement it with additional information before reporting it to RUM.

For example:

If the TTFB’s entry.responseEnd happened before the actual soft navigation startTime, it was fully prefetched;
If not, but if the entry.startTime happened before the soft navigation startTime, there was an attempt to prefetch it, but wasn’t done downloading (for whatever reason, which will be collected via other dimensions, such as bad internet connectivity);
If it doesn’t meet the above scenario’s, we consider it a request in a normal flow, but will add an index number (to cover the case described earlier where multiple entries were dispatched, this will help determine if other requests sat in between or were downloaded simultaneously, giving developers pointers as to where to start debugging).
Additionally, we will share if the LCP might have been depending on the TTFB entry, or (when entry.responseEnd is smaller than lcp.attribution.lcpResourceEntry.startTime) not.

A screenshot from RUMvision, showing the server response time (so, not full TTFB) for different resource priorities

In the case of the screenshot, files were clearly prefetched up front. And given the prefetch (finished) state, it works. But chances are that simultaneously and eagerly prefetching so many files up front could mean that not only prefetched files, but maybe also files within the critical path are impacted.

TTFB and LCP sub-parts

We now also have the info to change sub-parts of the TTFB and LCP metrics. For example, the LCP that happened after a soft-navigation will likely have a resourceLoadDelay.
But unlike hard-navigations, web-vitals can’t attribute any delay to the TTFB of a main document.

By this time, we do, so we will alter the resourceLoadDelay (or elementRenderDelay when the LCP is not an image) and set the timeToFirstByte sub-part (which -as explained- will always be 0 as the default reported TTFB will also be 0ms).

When it comes to the TTFB itself, we also calculate attribution timing to mimic the web-vitals way of reporting such metric data. But we alter it in two ways:

waitingTime

We calculate the difference between the TTFB’s entry.startTime and the actual startTime of when the soft navigation happened. This gets reported as the TTFB’s waitingTime (which already is around for hard-navigation TTFB).

Because if there’s a delay, site owners would want to know. But if that delay is below an even higher INP, it would otherwise not be reported in a consistent way, making it harder to be aware of bottlenecks.

resourceLoadTime

resourceLoadTimentry.startTime originally is an LCP sub-part and isn’t around for TTFB entries.

But TTFB represents the time that the first bytes of the original request are returned. And not when it was fully done downloading aka responseEnd. However, that is actually what might be important if your SPA needs its full contents to be able to act on it and render images and text.

Without this information, site owners would still have other TTFB sub-parts, but that could make them blind for issues with downloading the contents. That could regress over time, when the site’s traffic is growing, responses are growing, underlying architecture is slowing down or audience conditions are changing.

Non-soft navigation SPA tracking

This article is already longer than expected, so I will keep this short. We did need additional code to track SPA navigations that aren’t meeting the soft navigation API heuristics.

Our docs elaborate on this as well, but we will then fall back to using pushState or replaceState (to be configured by site owners). And we will set the reportAllChanges flag when using the web-vitals library, to then apply additional triage to beacon metric information at the right moment.

Conclusion

Within the sites where this is running already, we’ve seen very positive and consistent results. And while there are many RUM providers out there that will collect data of all resources, we’ve been able to pinpoint it a bit more, relate resources to the correct URL and LCP and shape it into web vitals heuristics.

Still experimental

Despite its consistency, I would still like to call it experimental, just like both the soft-navigation API, as well as the soft-navs branch. But putting it out there already might help other RUM providers and might help us to come up with improved heuristics on our end too.

Measuring soft navigation Core Web Vitals means it likely will not mirror CrUX data

The goal of the web-vitals library is different though:

The web-vitals library is a tiny (~1.5K, brotli’d), modular library for measuring all the Web Vitals metrics on real users, in a way that accurately matches how they’re measured by Chrome and reported to other Google tools

This is one reason the soft-navs branch is just that -a branch- and has not been merged into the main branch yet. We don’t know how (or even if) soft navigation Core Web Vitals will be reflected in CrUX.

RUM data can already be different than CrUX data. Tracking SPA navigations could cause this gap to become bigger. Either showing more positive numbers, or not.

In either case, SPA owners at least are able to measure this with the web-vitals soft-navs branch. Or benefit from the work that RUM providers do on top of this (in our case, TTFB included).

Although such data might not reflect your Core Web Vitals assessment nor its SEO value, you still want to know about things impacting your UX and revenue

Web Performance Calendar

How using Server-Timing API helped bring > 70% perf improvement

Coming back to how we came down to the API response in 400ms seconds as compared to 1.8s earlier

Implementation of Server-Timing as a middleware for AWS Lambda

Step 1. Create a withServerTimings() middleware to use for AWS lambda

Step 2. Apply the middleware to the request handler

Step 3. Implement timing for your methods

Step 4. Usage of timers of the methods

INP meets Puppeteer

Introduction

The problem of the imaginary customer

The RAW data/Boomerang beacons

Root cause analyses

Let’s measure (manually)

Puppeteer for the win

The results

The END

Lessons Learned from Building WebPerfDemo

Most metrics need a median

How to scale inefficiencies

Browsers and Servers are inherently efficient

Some Core Web Vitals are ‘lifespan’ metrics

AI / LLMs are not going to take our jobs

Tip of the Iceberg

Case Sensitive URLs

Mixed Case URLs

Impact on Caching

Conclusion

Measuring, monitoring and optimizing TTFB with Server timing

Why is my TTFB so slow?

Using the platform to help understand TTFB

Performance interface and high resolution timestamp

Server-Timing HTTP header

Putting the pieces together

The time function breakdown

Server timings in the wild

Choose your backend flavour

RUM tools

Security and privacy considerations

Analysis of INP performance using real-world RUMvision data

About RUMvision and its dataset

What exact RUM data was used for this analysis?

What is LoAF?

How RUMvision collects LoAF data

Analysis

How does INP perform?

When does the interaction responsible for a page’s INP occur?

Where is the interaction time spent?

Does it matter how good your device is?

Who are the common culprits?

Conclusions

Just one more thing…

Blind spot

The deactivation flow

Additional proposed attributes

Thoughts, comments?

Digging through Chrome traces: an introduction with an example

Introduction to Chrome tracing

What is a Chrome trace?

Tracing through the Devtools Performance Panel

Tracing through chrome://tracing and perfetto UI

Tracing though WebPageTest and other automated tools

Trace file format

Processing Chrome JSON Traces: A small custom application

The application

Choosing a trace file format and UI

Chrome JSON trace basics

Using grep or Ctrl+F

Searching chromium codebase for details

Writing some JavaScript to parse the trace

Lather, rinse, repeat!

What the viewport size is

What the screenshot size is

Which screenshot to line up with which paint

A simple app for a custom view into a trace

What about productionizing this?

Show us your ideas!

The Golden Rule of Web Performance and Different Performance Engineering Specializations

The Golden Rule of Performance Engineering

Wrong Conclusions Only

Coming back to how we came down to the API response in 400ms seconds as compared to `1.8s earlier`

Step 1. Create a `withServerTimings()` middleware to use for AWS lambda

`Performance` interface and high resolution timestamp

`Server-Timing` HTTP header

The `time` function breakdown