Teaching agents how to use the browser and test software - Web Performance enthusiast
AI is kinda everywhere!
Everyone is shipping AI features or Agents, but few of those systems can reason about the complex reality of performance data.
This session is a deep dive story into the journey of building a Performance AI assistant and agentic workflows on top of a fork of Chrome DevTools. We’ll explore practical lessons from DevTools internals on how to communicate performance data to Agents: teaching LLMs how to interpret performance data; which signals matter most; and how to transform telemetry into actionable insights generated by a LLM.
Now, we do have some metrics that are beginning to be established to measure performance when it comes to LLM inference, such as Tokens Per Second and Time To First Token. But this writeup is scoped so we can peek behind the curtain and, as an engineer obsessed about performance, how did I end up building a performance specialized AI agent based on DevTools internals. And how it helped me dive deeper into details of the different kinds of telemetry it gathers, whilst gaining different insights on how we may use AI tools to solve our problems, and also ultimately experimenting with what may be to come for the web as a platform.
This has been my hyperfocus for over 1.5 years now, so buckle up for a wild ride.
The trace file – from noise to signal


It might not look like it, but those two images represent the same data.
After digging deep into DevTools source code I started appreciating the art of data visualization. Behind any compelling graph we have a hidden layer of complexity. Transforming verbose data into visualizations is a hard job, and I believe that the DevTools team is doing a phenomenal job at it.
There’s so much effort involved in transforming raw data into information, and to a certain extent that is the job of everyone in this audience here. As engineers, and especially as performance specialists.
We have in this audience possibly the most apt minds for the task of transforming those visualizations and data into actionable insights, but the average developer struggles to understand and interact with this much information (referring to Perf panel UI). To some, what should be information becomes noise, and extracting insights becomes quite hard.
For those who have been keeping up with the updates since last year, the Chrome team is fully aware of the signal to noise problem, and has been on an absolute roll improving and simplifying how to extract insights out of DevTools, and I’m so happy to see it! The barrier to entry is becoming progressively lower and DevTools is becoming increasingly easier to use thanks to the incredible work being done.
For quite a while now, so much of my personal objective has been to facilitate performance work at different companies. Setting up pipelines and tooling, but ultimately falling short due to the imbalance on the signal vs noise ratio of dashboards and external tools used, which creates friction and reduces the potential value of those efforts.
One of my recurring observations has been around the insight vs confusion ratio for developers of different levels of expertise when tackling the data displayed by the profiler. Which led me to create PerfLab as an attempt to bridge that gap. After experimenting with that initial fork and learning lots about the trace file, I ended up venturing into creating my own performance specialized agent. The purpose was to experiment with new forms of interaction with UI and data based on context with the help of AI, whilst also figuring out how to autonomously extract insight and detect problems represented by the data on trace files.
The core goal was to significantly reduce the barrier to entry for developers and teams of any expertise level, and experiment with the most minimal amount of UI that a ‘DevTool’ could be. Another interesting pattern that I was curious to experiment with is displaying information progressively based on a given task, or generative UI. For I’ve noticed that building for agents means that less UI is needed upfront, enabling the interface to become as ‘rich’ as it needs to, according to the problem it needs to solve.
But, similarly to how performance is a hard vertical for developers, performance data is not something LLMs are known to be good at. They lack analytical capabilities and deep knowledge on telemetry data necessary to give insights and actionable points for real apps.
How an AI Agent ‘reads’ traces and why it matters
Maybe you’re like me before I embarked on this journey and think that LLMs are good at extracting any kind of information out of text-based conversation and data.
And though that is partially true, LLMs are not good at extracting meaning and insights out of data without context or previous training. And for that, let’s think of the word “context” as your primary concern when dealing with LLMs.
If we are trying to extract insights out of a trace file, we have to overcome the problem that it contains a vast amount of telemetry data that can be used for all sorts of insights and visualizations (as it is used for the Timeline Panel) in which, at this point in time, LLMs aren’t really trained on. But also, previous knowledge aside, handling that amount of data is a complex problem to solve when building agents.

First you have the length of the JSON you would need to make available to the agent, as trace files contain a large amount of events that represent all sorts of things that happened during its recording. From interactions to network events, each event containing different attributes that increases the total size of each individual entry to be ‘read’ and analyzed by the LLM, increasing the usage of its context window. And remember, our key metric is surrounding ‘context’, this is why lots of AI tools have either visual or textual ways to help you stay on top of this ‘AI Vital’.
Second you have some ‘hidden’ meaning on different attributes in each event, that the underlying LLM would be happy to try and ‘guess’ based on its previous learning (and get confidently wrong). Some entries represent groupings of ‘nested’ events that will add even more complexity into how to interpret each individual entry and the larger meaning of the whole grouping.
As an example we can look at an AnimationFrame event, which has its beginning and end events to represent the larger grouping, and within it we have events that represent each individual segment of an Animation frame. Those in turn would be used as attribution to an interaction on the Performance Panel or within LoAF entries from the JS API.
But third and most important: there are events, metrics and measures that are composed of, or related to, events that are spread around the different timespans.

Such as the events you see connected above, within an interaction. We see an event that is an initiator to events quite far time-wise, and responsible for presentation delay attribution, which is a composition of a measure of several tasks that compose the attribution to a metric, INP.
I’ve spent a good amount of time earlier in my research investigating different model architectures that could fit a custom pipeline to perform prediction and anomaly detection on data represented in these trace files. From Graph Attention Networks to Pyraformer and beyond, as the trace file is not only time-related but also hierarchical in nature. But since this is a side-project and not my actual job, I had to be very self-critical and experiment with pre-existing models, and since we are talking about performance I’ll try not to steer this too heavily into those subjects.
Let’s get back to perf, how do you take a trace file and help an agent understand the telemetry information within it? And how do you go about using that data observing the limitations of current models available. As even the largest models have a limited amount of context window, and it is generally true that the more things you add to it, the worse performance and outcome you get out of the models. With early signs of degradation present even before you get to fill 50% of a model’s context window.
Everything is data
Let’s frame this into the human side a bit and think about the role of a performance engineer. In very abstract and oversimplified terms, the role of a performance engineer is to analyze telemetry data, build insights on top of it and report back with the findings and possible fixes. Hopefully this specialized engineer will also be responsible for being part of the fixes so there’s a closer iteration loop and ensuring alignment to the original objective.
There are two sides of this, one is the knowledge and the other is the actual data presented. The quality of both are needed to solve the problem at hand successfully. So as a performance engineer, one will manage to trace, analyze and fix performance issues based on contextual framing of knowledge against telemetry data. In the most rudimentary possible abstraction, an agent will use similar context engineering strategy via prompting and tool calling to represent knowledge and telemetry data.
Back into the AI parts, the thing about working with agents is the problem with the limited amount of information you can ‘fit’ into the prompts, the way the attention mechanism works, and different problems on ‘remembering’ what is in that context window and the decay of quality as it grows. Which generally means that, the more focused the context window is in one singular objective, the better outcome you’ll have.
So, taking this insight into consideration, I’ve experimented around segmenting the trace data into different problem spaces to optimize context and facilitate information retrieval. That way I could narrow the main objective of a given request to one part of the trace file and get a much better analysis out of it.
Now, I could just hand the agent segments of a trace file in an ad-hoc way. If you have been researching about AI systems you might have stumbled upon Retrieval Augmented Generations, or RAG. But the process of extracting the meaning out of a given text or document is highly trained on top of semantical meaning for most embedding models, and that is directly related to the quality of the retrieval of the significant parts on a semantic or similarity search for the retrieval part of the RAG system.
So, I wasn’t sure of the effectiveness of using pre-existing embedding models and their capacity of encoding information from trace files. Remember that second and third points mentioned earlier, where within a trace file you not only have meaning being spread across different sequential events but also, possibly across different points in time. And some events are co-related and should not be used in isolation. Yeah, that meant that in order to get to the relevant parts of the trace I’d have to figure out a different way to drive the model towards the right parts of it in order to fetch the telemetry data needed.
See, I told you this would be a wild ride
Trace engine and agent ‘routing’
Let’s take a closer look at the ‘engine’ that powers the insights and data wrangling of it all. Whilst building PerfLab, one of my very early interactions with DevTools code was through its trace engine. The internal piece that parses the trace file and transforms it into a structured output that contains all the different information segmented into sub-categories. Things like insights, that back then were only starting to be developed into DevTools, and the different areas of interest, such as Interactions and Network events.



By the time I started with building my own agent, the trace engine was a lot more mature and with better scoped insights. So I’ve extracted it alongside a separate, and much newer, piece of the DevTools internals which has to do with the internal ‘Ask AI’ tools to help me parse parts of the trace data into a context engineering step for the agent to useāmore on that later. This way I could better classify the trace file into a well structured format, ready to be used by the agents. But that left me with a problem, now I have an object in memory and need my agent to know what parts of it to use according to the user’s request.
At this point in time I was still reading papers and learning more about different model architectures and stumbled upon the Mixture Of Experts (MoE) architecture. It aims to comprise a large amount of information segmented under different sub-networks of ‘experts’ where each specializes on a subset of the training data with a router in front that decides which ‘expert’ or ‘experts’ to activate depending on the given incoming ‘request’. This increases the total model capacity for learning different concepts but also improves efficiency since only a few of the different agents will be activated on every request.


Using a smaller model to act as a router and ‘pick’ a workflow based on request
So I’ve set out to kinda emulate the MoE architecture with a small LLM acting as a router, performing some form of sentiment analysis to understand which agent or workflow to trigger based on the incoming request. That helped me segment perf data into different specialized agents, and also simplified how I built the information retrieval from the structured output generated by the trace engine, by focusing each specialized agent into a different part of it.
The actual implementation that I ended up coding was limited to only a couple of objectives and parts of the total possible parts, but it proved to be very effective.
I’ve started playing around with different library and framework setups, from using pure AI SDK to eventually landing on using Mastra and AI SDK together. The reason for it is, especially back at AI SDK V4, we had very little tools to code agents with just AI SDK. The simple arrangement of tools and maxsteps left too much control to the LLM and, especially earlier this year, it was known that LLMs struggle to perform tool calls correctly and consistently. Which also made context management really hard, as I had to expose the agent to different tools to slice through and serialize the trace engine output.
But also, since analyzing trace files would have a predictable number and sequence of steps to be performed, a workflow was a better fit for the job. So relying on workflows instead of ‘just’ tools and leaning on the ‘router’ agent to direct requests to the appropriate workflow and sub-agents yielded a much better and more reliable result in the end.
Engineering context for performance data
Within DevTools we now have several internal agents that are focused on different aspects of the vast amounts of telemetry that it exposes. From network calls, to parts of the call tree and insights. And we already touched on a few of the reasons why that is a good strategy.
(Again) Breaking down a complex problem into sub-parts, where different (sub) agents can focus their ‘attention’ to, is a great way to improve the quality of the outcome when working with agents to solve a difficult problem, or to ingest large amounts of data and extract information.
As briefly mentioned just a moment ago, on my own experiments I’ve applied a similar strategy to my performance agent and its sub-agents. Utilizing Mastra workflows written into different steps that tend to parsing and analyzing different possible parts of the output generated by the trace engine. In those different steps we can spawn agents to tackle different tasks or break into async or conditional processing based on the ‘handover’ of each previous step/agent.
With this ‘specialized agents’ strategy it gets simpler to write better prompts that focus each agent into a narrower objective, giving much better quality insights as a result. The broader the objective of an agent, the more grounding and prompting needed. Which also will occupy more of the context window with information, leading to confusion, lost-in-the-middle problems, hallucinations and so on.
To a certain extent, this is what happens between the announced DevTools MCP server and the agent behind whichever client uses it. DevTools exposes via different tools ways to interact with its internal agents and drive a web page, where the orchestrating agent behind the consuming client that uses the server will decide what tool to call and when based on the given context. Whilst consuming in return its ‘handover’ output.
Which brings us to the next topic!
Using MCP to connect coding agents with specialized performance insights
By now I believe that most of us have at least heard of Model Context Protocol, or MCP. Similar to other competing standards, MCP allows agents to interact with external services or even other agents with remote tool calling via HTTP or STDIO.
This adds another layer of possibility and also another layer of complexity. The possibility is that agents can now work as orchestrators and delegate specialized tasks to remote services or external agents via remote tool calling, receiving artifacts as handovers.
Noticing a pattern here?
The complexity is that those tools also compete with everything else in the context for ‘attention’ and excessive usage of tools will add to the ‘death by a thousand cuts’ of your context window. With MCP we can connect our (coding) agents to tools like Sentry, Linear and many others. But also, since its public release on September 23rd, DevTools!
If you missed out on the announcement, DevTools official MCP server allows agents to debug pages and gather insights straight from an instance of Chrome! With all the intelligence that Chrome is building within its internal agents, we can now have a workflow that delegates the telemetry gathering and information processing part of the ‘problem space’ directly to DevTools’ specialized agents. Making it possible to create even more powerful development and investigative workflows.
This is a more ‘autonomous’ approach than the workflow I’ve initially developed, since the decision-making is entirely up to the orchestrating agent, and how it decides which tool to call based on current context and tool description.
On my own experiments whilst building an MCP client and servers for DevTools and V0 and experimenting with Claude Code as orchestrating agent, I’ve been able to reach levels of automation that are quite interesting. Prompting V0 via MCP to create a new ‘toy app’ or prototype, to have DevTools profile and extract insights out of the given sandbox link and having those insights handed back to V0 for course correction and possible fixes. Though far from ‘production ready’ MVPs, those experiments already proved to me that there are some interesting patterns for enabling levels of automation for work around telemetry for ‘course correction’ that are quite exciting.
One thing that I can preface when integrating MCP into your development workflow is that, since MCP tools also end up on the context window, and similar to how it’s preferable to optimize the context window by narrowing the objective and consuming a minimal amount of tokens to scope an objective, you should only have active the set of tools needed for a given task you are trying to achieve.
Now that we’ve gotten some background on how I’ve built an AI agent that can answer questions and give insights about performance and trace data and experimented with different levels of automation using agents and tools, let’s look into how to effectively use that knowledge when using LLMs to help you analyze performance data, or even profile a web page and give you insights.
When using AI tools and agents to help you gather insight or speed up your workflow it’s important to narrow a task to a concise and clear objective. When iterating with AI tools it’s important to keep in mind that how you steer the agent and the kind of data it needs to fetch and hold into context will affect the outcome.
So here’s an important tip: keep your session focused and well structured, delegate research and planning to sub agents or use a separate session for those stages, generating a MD artifact that the agent that will produce the actual work will consume as reference.
“Trust but verify”: You are the expert in the loop
Utilizing agents and AI tools can bring new capabilities and help iterate over tasks faster. But, similar to any tool out there, there’s a number of constraints and techniques to be learned and considered to use those tools well.
The more I research and learn about building AI agents and applications that use them on different workflows and scenarios, the more I understand that the people that use them are the key part of the puzzle. There’s a considerable amount of effort to ensure alignment and quality, and it is always up to us to use our tools appropriately.
I’m a believer of using tools to augment, speed up or even expand our capabilities as developers and experts. At the very least I see it as an opportunity to experiment with new ways to build and think about software that enables exciting new patterns and discoveries. But, as the ‘H’ part of the HITL, we are still responsible for what we ship.
Either shipping code or producing investigative work, we must verify the content and artifacts generated by agents.