An Informed Pre-caching Strategy for Large Sites

4thDec 2023 by Ethan Gardner

ABOUT THE AUTHOR

Ethan Gardner (ethangardner) is a full-stack engineer with expertise in front-end development, focused on creating high-performance web applications, identifying process efficiencies, and elevating development teams through mentorship. He's currently a Principal Engineer working in the media industry where he provides guidance and vision for new and evolutionary engineering ideas.

Earlier this year, I was exploring data on the sites I work on and found that a 500ms reduction in the session’s median First Contentful Paint (FCP) results in a 20% increase in user engagement. By definition, conversion events, such as purchases, only occur in engaged sessions. Given my employer’s desire to boost conversions, it became a point of focus to improve the FCP so there is a higher probability of having an engaged session.

There are plenty of front-end optimizations that can improve FCP, but today I want to talk about a different piece of the puzzle and the shortcoming I discovered in our HTTP caching strategy while I was poking around.

Using a page cache really aims to improve the Time to First Byte (TTFB), but FCP is dependent on TTFB. Consequently, enhancements to TTFB will also influence other downstream metrics. The specific page cache strategy I’m going to cover is pre-caching, also called cache priming or cache warming, which involves preemptively putting items into the cache before they are requested by a user. This can be a great way to reduce expensive computations and strain on the origin server, but once a site gets to a certain scale, it requires more thought than simply keeping everything in a warmed cache.

Some background

I primarily work on read-heavy, magazine-style sites. For the sake of this discussion, it’s helpful to know the following about each site:

Contains up to 1M URLs in the sitemap
Are proxied with Cloudflare
Are hosted on AWS

The URL count includes the auto-generated content taxonomy items from the CMS, such as author pages, categories, and tags. However, plenty of URLs are editorial content, such as articles, forum posts, and collections that have been curated by editorial staff.

Mo’ URLs, Mo’ Problems

For any site, a sudden influx of traffic can bring the database to a crawl if requests are hitting the origin server, especially if expensive queries and computations are involved. A common proactive solution is to pre-cache pages so that they are sitting ready in cache, and the work on the origin server can be avoided entirely.

However, pre-caching comes with some issues and trade-offs. The easiest strategy is to parse the sitemap and pre-cache URLs in batches. A few tools like optimus cache prime or W3 Total Cache in the WordPress space work by pointing the tool at a sitemap and crawling each URL in it. Most tools of this type allow you to set the batch size and crawl interval, but caching in this manner means that you might cache things that aren’t necessary. Also, if the batch size or interval is too high, you run the risk of CPU thrashing, which would have the same effect as a sudden influx of traffic that has to hit the origin.

The other danger is pre-caching the wrong things. In a site with many URLs, some of them rarely get visited while others can be extremely popular. The home page of the site probably gets more traffic than the auto-generated author profile page for someone who contributed 1 article 30 years ago. In terms of pre-caching, the home page is much more important. Additionally, you might have a critical step in the conversion funnel that might not receive lots of traffic, but you want it to be pre-cached because of its importance.

Another concern for large sites is the crawl rate. When I looked at the settings for my employer’s pre-cache solution, it was still set at the default of 10 URLs every 15 minutes. For a site with 1M URLs, that’s laughably ineffective. At that rate, it would take almost 3 years for the cache priming bot to traverse the entire sitemap. By the time the last URL is pre-cached, the first URLs have long expired.

The “pre-cache everything” strategy could be perfect for a smaller site with a few hundred URLs and content that rarely changes, but once a site hits a certain scale, the wheels fall off.

Analytics-based pre-caching

If you use an HTTP cache, there is likely some type of logging for cache hits and misses. I mentioned before that the sites are proxied with Cloudflare, and their analytics API exposes information about traffic, URLs, and cache status. In other words, it’s a perfect source of data to use for the URLs to pre-cache. If you don’t happen to use Cloudflare, that’s ok because other vendors offer a similar API that exposes this information, and so do software packages commonly used for caching like Redis or Varnish. Since the data comes from an API, it’s possible to run a query on a schedule, keep the resulting data in a persistence layer, and crawl the URLs that you want to pre-cache. Cloudflare’s API happens to be GraphQL, and the query for the first page of results looks like this:

{
    viewer {
        zones(filter: { zoneTag: $tag }) {
            httpRequestsAdaptiveGroups(
                orderBy: [count_DESC]
                limit: 250
                filter: { 
                  date: $date,
                  edgeResponseContentTypeName: "html",
                  cacheStatus_in: ["hit", "expired", "miss"]
                }
            ) {
                count
                data: dimensions {
                    cacheStatus
                    clientRequestPath
                    edgeResponseContentTypeName
                }
            }
        }
    }
}

In Cloudflare nomenclature, the zoneTag is the numeric ID of the DNS zone that comes from their dashboard. We’re ordering the query results based on the event count and only returning URLs whose content type is HTML, where the cacheStatus is “hit” (in the cache with a valid TTL), “expired” (in cache but past its TTL), or “miss” (not in cache and served from the origin). The date is a dynamic variable for the previous day.

We only want to return URLs that were eligible for cache for the purposes of pre-caching, so that’s why the statuses are limited to hit, expired, and miss. Additionally, we want to limit the requests to HTML because static assets like images, CSS, and JS have a comparatively longer TTL and don’t involve any database operations like a request to a dynamic HTML page would.

The resulting data will look like this:

{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "count": 6130,
              "data": {
                "cacheStatus": "miss",
                "clientRequestPath": "/",
                "edgeResponseContentTypeName": "html"
              }
            },
            {
              "count": 5310,
              "data": {
                "cacheStatus": "hit",
                "clientRequestPath": "/",
                "edgeResponseContentTypeName": "html"
              }
            },
            {
              "count": 1610,
              "data": {
                "cacheStatus": "expired",
                "clientRequestPath": "/",
                "edgeResponseContentTypeName": "html"
              }
            },
            {
              "count": 930,
              "data": {
                "cacheStatus": "miss",
                "clientRequestPath": "/sponsoredpost/2023/10/03/foster-positive-client-relationships",
                "edgeResponseContentTypeName": "html"
              }
            },
            {
              "count": 660,
              "data": {
                "cacheStatus": "miss",
                "clientRequestPath": "/howto",
                "edgeResponseContentTypeName": "html"
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}

We can easily take this structured data and insert it into a database where the paths can be aggregated and queried. There may be some false positives due to things having an incorrect content-type on the origin, so it would be best to handle that logic before the data is written to the database. When I ran this query, for example, I found instances of JSON endpoints that were mistakenly being served with content-type: text/html; charset=UTF-8 instead of a proper application/json value.

Cache distribution

If you use Cloudflare, HTTP cache is local to the data center. This means that if you have a request that is cached in the IAD data center, and someone hits the ORD data center, the request is not expected to be in cache there and needs to be fetched.

To handle the pre-caching across multiple regions, you can use AWS Simple Notification System (SNS) and deploy Lambda functions to strategic regions. SNS can distribute cross-region events to Simple Queue Service (SQS) or Lambda functions, so the crawl of the pre-cached URLs can be replicated in multiple locales. There was some initial concern about a possible mismatch between the AWS regions and Cloudflare’s data centers, but Cloudflare has a solution for that with their tiered cache. However, you should test and monitor the cache hit rate to be certain you’re getting the desired outcome.

The URLs we want to publish to SNS for pre-caching can be established by setting parameters, such as querying an aggregate sum of the “expired” and “miss” statuses for each URL over the last two weeks sorted by the number of views. This accounts for trends in traffic and pares down the list of pre-cache eligible URLs to a manageable size.

What about overall traffic?

The most popular URLs can be treated as a special case. In the cache world, there is a problem called a cache stampede, also known as a thundering herd or dog piling, where a simultaneous influx of traffic hits a URL that is not in cache or expired. The result of this influx of traffic is that the requests happen before the response is written to the cache, resulting in many calls to the origin for the same (potentially expensive) computational resource.

As a countermeasure for the cache stampede, there are some things you always want to have sitting in the cache. One way to do this with web traffic is to eagerly replace the cached item before its TTL expires. You can treat these special URLs as a priority queue by adding SQS between the Lambda function and SNS from our solution before.

Instead of sending notifications directly to Lambda, SNS will send notifications to either a high-priority or standard-priority SQS queue so the priority URLs get generated over the ones in the other dataset described in the previous section. A Lambda function sits behind each queue and is still responsible for processing the messages as before.

Further improvements

We started with an ineffective pre-cache strategy and now have something much narrower in scope, but also much more effective. However, there is a potential issue with our solution so far, and that is that it is based on past data. In some cases, it may be even better to use historical data to make predictions on traffic patterns.

It seems like right now, many people are trying to retrofit AI into their application, and I think training an AI model to do predictive analytics on traffic patterns makes a lot of sense. Traffic might have seasonality to it, so you may want to be proactive about what you’re pre-caching, and this is where a predictive model can be immensely helpful. For example, the Thanksgiving holiday in the US is the 4th Thursday in November, and the traditional meal is turkey. If you waited for historical data, you might miss your pre-cache window for a super popular recipe post because there are going to be a lot of people who are looking at the recipe the day of Thanksgiving, and the data collection runs one day behind.

If there were a predictive model in place, it would help discern the URLs that have a probability of being popular around a certain date. By storing the information returned from the cache analytics API in our own database, we have the flexibility to pursue this option if we ever want to go down that road.

2 Responses to “An Informed Pre-caching Strategy for Large Sites”

David December 4th, 2023
It sounds like modern static / Jamstack would be a useful approach here?
Ethan Gardner December 5th, 2023
It depends. As with any decision, there are constraints and tradeoffs. There were a lot of factors that went into planning this like team size, skills, business priorities, potential disruptions, cost, and level of effort. There are dynamic features in the app and user events that cause cache invalidation. For small sites, you can pretty easily generate everything statically, push to the CDN in the build pipeline, and invalidate the cache. For 1M URLs, the build time would be really long if it was a static build. In that case, I would want to do SSR or ESR, and the problem would still exist of how to get the right things in the cache and at the right locations, as well as distributing the database in the ESR scenario.

In some cases, the platform (e.g. Netlify) tries to handle it for you by keeping as much in cache on the edge as possible, but you’ll still have misses. Using analytics to identify where the misses are happening and selectively put higher priority URLs into the cache can still be a helpful tactic to fill the gaps when you can’t keep everything out there on the CDN.

Web Performance Calendar