Web Performance Calendar

The speed geek's favorite time of year
2025 Edition
ABOUT THE AUTHOR
Photo of Ethan Gardner

Ethan Gardner (@ethangardner.com) is a full-stack engineer with expertise in front-end development, focused on creating high-performance web applications, identifying process efficiencies, and elevating development teams through mentorship.

When I explain the difference between lab (aka synthetic) and field data to people, one of the things I mention is that the lab allows for testing under repeatable, controlled conditions. Each test run offers an apples-to-apples comparison with previous tests, and results are available almost immediately. On the other hand, field data measures actual user experience, but it requires time to collect enough sample data for the results to be relevant.

One thing I have always been curious about is how shifts in synthetic data impact field metrics. For example, if I reduce a lab metric, is there any way to anticipate how that might change the field metric without having to wait for data collection? As I was taking Frank Kane’s Machine Learning, Data Science and Generative AI with Python course, I learned about XGBoost, a tool that excels at prediction. As I learned about its capabilities, I wondered if it could be used to predict the relationship between lab and field data in a meaningful way.

What is XGBoost

As the name suggests, XGBoost is based on a technique called gradient boosting. In gradient boosting, a model learns from its previous errors by assigning different weights to misclassified data points. With enough training data and iterations, XGBoost can make increasingly accurate predictions for a specific target variable.

XGBoost is also computationally efficient, which means I can run it on my local machine without having to spin up a cloud environment with a GPU attached to it.

Getting data

My goal for this experiment was to see if I could predict a Largest Contentful Paint value at p75 based on a URL’s synthetic data. To do this, I need a list of pages that have both field and lab data to train the model. For the lab data, I used httparchive‘s summary and payload tables, and the Chrome User Experience Report API was the source of the page-level p75 LCP field data. I saved the results to a CSV file and joined them by page using a pandas dataframe.

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

synthetic = pd.read_csv("./data/pages-10k-with-summary-mobile-20251101.csv")
crux = pd.read_csv("./data/crux-api-results.csv")

merged_df = pd.merge(synthetic, crux, on="page")
print('Rows:', len(merged_df))
print('Columns:', len(merged_df.columns))
    Rows: 6800
    Columns: 95

Reducing noise

The resulting dataset contains 95 columns, but not all of them impact LCP. To reduce noise, I pared down the columns to include only the most relevant fields. I also noticed that there were rows that did not have lcp_p75 values, so those rows needed to be dropped from our dataset.

Once the data was pared down to the relevant columns, I could isolate the prediction column and run it through the train_test_split function. In this case, I used 80% of the data to train the model and reserved the remaining 20% for testing.

I created a couple of functions below since I plan on calling these multiple times during the training and evaluation process. XGBoost uses a special data type called DMatrix, and I also set that up in the prepare_model_inputs function.

PREDICT_COL = "lcp_p75"
features = [
  "SpeedIndex",
  "maxDomainReqs",
  "numDomains",
  "bytesJS",
  "bytesCss",
  "reqJS",
  "reqImg",
  "bytesImg",
  "bytesJpg",
  "bytesPng",
  "bytesWebp",
  "numDomElements",
  "renderStart",
  "fullyLoaded",
  "TTFB",
  "reqHtml",
  "bytesHtml",
  "visualComplete",
  "onLoad",
  "onContentLoaded",
  "gzipSavings",
  "numRedirects",
  "fcp",
  "lcp",
  PREDICT_COL,
]
def prepare_model_inputs(df, target_col, model_features, test_size=0.20, random_state=42):
  """
  Cleans the data and splits it into training and testing datasets.
  """
  df_clean = df.dropna(subset=[target_col]).copy()

  x_features = [col for col in model_features if col != target_col]
  X = df_clean[x_features]
  y = df_clean[target_col]

  X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
  )

  dtrain = xgb.DMatrix(X_train, label=y_train)
  dtest = xgb.DMatrix(X_test, label=y_test)

  return {
    "X_train": X_train,
    "X_test": X_test,
    "y_train": y_train,
    "y_test": y_test,
    "dtrain": dtrain,
    "dtest": dtest
  }


def train_and_evaluate(dtrain, dtest, y_test, xgb_params, num_boost_round, verbose_eval):
  """
  Trains the XGBoost model and evaluates its performance.
  """
  model = xgb.train(
    params=xgb_params,
    dtrain=dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtrain, "train"), (dtest, "test")],
    verbose_eval=verbose_eval,
  )

  y_pred = model.predict(dtest)
  mae = mean_absolute_error(y_test, y_pred)
  r2 = r2_score(y_test, y_pred)

  return {
    "model": model,
    "y_pred": y_pred,
    "mae": mae,
    "r2": r2
  }

Now that our functions are defined, we can pass in parameters and begin to train the model.

params = {
  "objective": "reg:squarederror",
  "max_depth": 6,
  "learning_rate": 0.005,
  "subsample": 0.5,
  "colsample_bytree": 0.8,
}

inputs = prepare_model_inputs(merged_df, PREDICT_COL, features)
results = train_and_evaluate(inputs["dtrain"], inputs["dtest"], inputs["y_test"], params, 5000, False)

Examining accuracy

At this point, we have no idea about the quality of the predictions. There are two statistical measures that can help us gauge the accuracy, mean absolute error and the r-squared score.

Mean absolute error indicates the average difference between predictions and actual values.

The r-squared score tells us how much of the statistical variance can be explained by the model. A maximum value of 1 (or 100%) signifies perfect predictions, while a 0 indicates that the predictions cannot be explained by the training data columns. A “good” r-squared value depends on what you’re trying to predict.

Let’s see how we did:

print(f"Mean Absolute Error: {results["mae"]:.2f}")
print(f"R-squared: {results["r2"]:.2f}")
    Mean Absolute Error: 796.84
    R-squared: -0.24

Analysis

The initial results were not good at all. The mean absolute error indicates predictions were off by an average of nearly 800ms, a 32% margin of error on a 2500ms target.

The larger concern is the negative r-squared score, which indicates the model was flawed and its features aren’t a good predictor of LCP at p75. The options at this point are to tweak the training parameters or to add new features to the training data. Since the r-squared value is negative, I started with adding more information.

I was curious about the difference between the CrUX LCP metric and the HTTPArchive’s LCP metric. In other words, I was interested in seeing if the device and connection profile used for HTTPArchive is representative of the experience of a real user at the 75th percentile.

merged_df["lab_vs_field_lcp_diff"] = merged_df["lcp_p75"] - merged_df["lcp"]
print(f"Lab vs. Field Average Difference: {merged_df["lab_vs_field_lcp_diff"].mean():.2f}")
print(f"Lab vs. Field Median Difference: {merged_df["lab_vs_field_lcp_diff"].median():.2f}")
print(f"Lab vs. Field Standard Deviation: {merged_df["lab_vs_field_lcp_diff"].std():.2f}")
    Lab vs. Field Average Difference: -3624.82
    Lab vs. Field Median Difference: -1827.00
    Lab vs. Field Standard Deviation: 7110.11

Comparing the difference between lab and field data showed me that there was significant variance. On average, the lab LCP was over 3.5 seconds higher than the real user LCP at the 75th percentile. Initially, I was tempted to simply feed this calculated variance into our model, but unfortunately, that information wouldn’t be known ahead of time if we were going to deploy the model in the real world.

Instead, we can use historical origin data from CrUX and add this to our dataset. I’m adding some raw metrics directly from the query, but I’m also engineering some metrics. These engineered metrics are a combination of two or more columns that help XGBoost determine relationships, e.g. df_with_origin["lcp_p75_origin_diff"] = df_with_origin["p75_lcp_origin"] - df_with_origin["lcp"]. The hope is that the new features might help correct the negative r-squared value from before.

origin_details = pd.read_csv("./data/origin-summary-20251101.csv")

origin_details['page'] = origin_details['origin'] + '/'
df_with_origin = pd.merge(merged_df, origin_details, on="page")
df_with_origin["lcp_p75_origin_diff"] = df_with_origin["p75_lcp_origin"] - df_with_origin["lcp"]
df_with_origin["fcp_p75_origin_diff"] = df_with_origin["p75_fp"] - df_with_origin["fcp"]
df_with_origin["img_bytes_ratio"] = df_with_origin["bytesImg"] / df_with_origin["bytesTotal"]
df_with_origin["js_bytes_ratio"] = df_with_origin["bytesJS"] / df_with_origin["bytesTotal"]
df_with_origin["css_bytes_ratio"] = df_with_origin["bytesCss"] / df_with_origin["bytesTotal"]
df_with_origin["ttfb_lcp_ratio"] = df_with_origin["TTFB"] / df_with_origin["lcp"]
df_with_origin["lcp_diff"] = df_with_origin["lcp_p75"] - df_with_origin["lcp"]
df_with_origin["avg_resource_size"] = df_with_origin["bytesTotal"] / df_with_origin["reqTotal"]

The extra features below will increase the column count considerably and give our model more information to work with.

features.extend([
  "fast_ttfb",
  "avg_ttfb",
  "slow_ttfb",
  "fast_fcp",
  "avg_fcp",
  "slow_fcp",
  "fast_lcp",
  "avg_lcp",
  "slow_lcp",
  "p75_lcp_origin",
  "_4GDensity",
  "_3GDensity",
  "_2GDensity",
  "slow2GDensity",
  "low_rtt",
  "medium_rtt",
  "high_rtt",
  "p75_rtt",
  "lcp_p75_origin_diff",
  "fcp_p75_origin_diff",
  "_cpuCommitLoad",
  "_cpuEventDispatch",
  "_cpuFunctionCall",
  "_cpuHTMLDocumentParserFetchQueuedPreloads",
  "_cpuIdle",
  "_cpuLayerize",
  "_cpuLayout",
  "_cpuMarkDOMContent",
  "_cpuMarkLoad",
  "_cpuPaint",
  "_cpuParseHTML",
  "_cpuPrePaint",
  "_cpuUpdateLayoutTree",
  "_cpuTimes",
  "_cpuTimesDoc",
  "_lighthousePerformance",
  "_lighthouseTBT",
  "_renderBlockingCSS",
  "_renderBlockingJS",
  "img_bytes_ratio",
  "js_bytes_ratio",
  "css_bytes_ratio",
  "ttfb_lcp_ratio",
  "avg_resource_size"
])
data = df_with_origin[features]

print('Rows:', len(data))
print('Columns:', len(data.columns))
    Rows: 6797
    Columns: 69
inputs = prepare_model_inputs(data, PREDICT_COL, features)
results = train_and_evaluate(inputs["dtrain"], inputs["dtest"], inputs["y_test"], params, 5000, False)

print(f"Mean Absolute Error: {results["mae"]:.2f}")
print(f"R-squared: {results["r2"]:.2f}")
    Mean Absolute Error: 499.36
    R-squared: 0.66

Improving model accuracy with hyperparameter tuning

Adding more features to our training data improves the accuracy of our model and corrects the r-squared value issue from before. In fact, 0.66 for r-squared is quite strong, so that improvement is drastic. The new features also reduced the mean absolute error by 37%.

At this point, I’m pretty happy with the results. However, when I initially set the parameters for the model, I took a wild guess, and I wanted to see if there was a more optimal combination of parameters that might enhance the model’s accuracy.

This can be done programmatically in a number of ways. I used RandomizedSearchCV from scikit-learn because it offers a balance between speed and quality. This process explores different combinations of parameters and selects the set that yields the best performance according to a specified scoring metric (in our case, mean absolute error).

This will still run locally, but it will take a long time to churn through all the permutations as it looks for the best combination of parameters.

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from scipy.stats import uniform, randint
xgb_model = XGBRegressor(objective='reg:squarederror', random_state=42)

# These ranges are chosen to explore different model complexities and learning behaviors.
param_distributions = {
  'n_estimators': randint(500, 5000),
  'learning_rate': uniform(loc=0.005, scale=0.045),
  'max_depth': randint(3, 8),
  'subsample': uniform(loc=0.5, scale=0.2),
  'colsample_bytree': uniform(loc=0.7, scale=0.2)
}

random_search = RandomizedSearchCV(
  estimator=xgb_model,
  param_distributions=param_distributions,
  n_iter=25,
  scoring='neg_mean_absolute_error',
  cv=3,
  verbose=500,
  random_state=42,
  n_jobs=-1
)

print("Starting RandomizedSearchCV... This may take a while.")
random_search.fit(inputs["X_train"], inputs["y_train"])
print("RandomizedSearchCV finished.")
print(f"Best parameters found: {random_search.best_params_}")
print(f"Best Mean Absolute Error found during cross-validation: {-random_search.best_score_:.2f}")
Best parameters found: {'colsample_bytree': np.float64(0.8689067697356303), 'learning_rate': np.float64(0.03862940495618213), 'max_depth': 7, 'n_estimators': 4493, 'subsample': np.float64(0.6930510614528276)}
Best Mean Absolute Error found during cross-validation: 543.77
best_params = random_search.best_params_

results = train_and_evaluate(inputs["dtrain"], inputs["dtest"], inputs["y_test"], best_params,
                             best_params["n_estimators"], False)
print(f"Mean Absolute Error: {results["mae"]:.2f}")
print(f"R-squared: {results["r2"]:.2f}")
    Mean Absolute Error: 494.27
    R-squared: 0.72

Analysis of Tuned Model Performance

After performing hyperparameter tuning using RandomizedSearchCV, we have new performance metrics for our XGBoost model that we can compare to the previous ones. There was a slight improvement in accuracy as shown by the reduction in mean absolute error, but it’s not dramatic. Our r-squared value is now 0.72, which indicates that the features we added to the model are highly effective at predicting LCP at p75.

Since the improvement resulting from parameter tuning was marginal, it might suggest that further improvements might require different strategies.

More extensive feature engineering

As mentioned previously, I added some engineered features before, but all of them were linear. To improve the model, I could add more engineered features including logarithmic features that might be helpful to reduce the impact of statistical outliers in the data.

Different model architectures

While XGBoost is powerful, other models might be better suited for this specific problem. There are other algorithms, such as LightGBM and CatBoost, that might give better results, or we could use a neural network.

Collecting more data

Sometimes, the limitation is the data itself. For this experiment, I limited the dataset size to minimize runtime and stay within Big Query’s free tier limits. If we added more data, we could potentially get better results.

Summary

In the end, we were able to create a reasonable predictor of LCP at p75 using a synthetic dataset. It doesn’t mean that the model is perfect, but it could help see what a shift in the synthetic metrics will do to the real user experience without having to wait for real data to be collected.