Media Rehydration, Or: How To Kill The Middle Mile

27thDec 2022 by Tobias Baldauf

ABOUT THE AUTHOR

Tobias Baldauf

Tobias Baldauf (@tbaldauf) toys with web performance at trivago. He creates innovative web performance tools, new image optimization algorithms and speaks at conferences. He's a proud dad, tries to be a mindful vegetarian and loves making music. Find out more at who.tobias.is

AI Content Generators & Training Data

AI-generated content is all the rage these days. Highly creative images flood our screens and a chatbot passes an AWS exam. These new generations of text/image-to-image and language models are easy to try online & run on local devices.

With the easy availability of these tools, new communities share “prompts” to replicate generated content, often images, given the same trained diffusion model and version. These prompts contain the (often hilarious) text prompt, some generation parameters for styling and output quality and finally the “seed”, a hash that allows us to exactly replicate the output. Here’s an example:

Mad Max (movie) fighting Dinosaur jurassic park, Steps: 42, Sampler: PLMS, CFG scale: 7, Seed: 1985738629, Size: 512x512, Model hash: 7460a6fa

Mad Max (movie) fighting Dinosaur jurassic park, Steps: 42, Sampler: PLMS, CFG scale: 7, Seed: 1985738629, Size: 512x512, Model hash: 7460a6fa

The output that can be successfully generated depends largely on the contents & quality of the training data set. Stable Diffusion uses the Laion2B-en data set which can also be explored visually. The variety of the training data set in combination with descriptions and metadata gives Stable Diffusion and similar tools the power to generate many different outputs sufficiently well.

While tools like Stable Diffusion, Dall-E, Imagen, MakeAScene, Ediffi for images & videos or ChatGPT for texts and code are sure to take us on an interesting journey regarding copyright, verifiable media and eventually people’s livelihoods, there’s a yet little discussed side effect to them that holds huge potential.

Dance the “Rumba” around the Middle Mile

Let’s assume you are a manufacturer of apparel, most famous for its iconic shoes. Your e-commerce business is growing globally, esp. your shoe “Rumba” that you offer shoppers to customize with colored laces, soles etc. while the base of the shoe remains identical. Images showing the “Rumba” in all its hundreds of possible combinations of different lacing and sole colors are kept in your DAM in Amsterdam (great IX) while your CDN vendor delivers these assets for you globally.

The CDN has to navigate the volatility of the global Internet and move your contents through multiple IXes and ISPs, a transit for which it pays money and in turn makes you pay more money. As traffic flow ebbs and swells and seasonal trends affect which color combinations are currently trending, different subsets of the “Rumba” image inventory are cached and out of cache across CDN Edge nodes and their parents. Therefore, your Origin offload for the “Rumba” image inventory tops out at ~80% and you keep paying to make your image contents transition through the CDN Middle Mile. As your brand is very conscientious regarding perceived quality, your gallery images for the “Rumba” average out at ~50KB per asset, resulting in ~10TB of monthly transitory image traffic for the “Rumba” image inventory because it is your most popular shoe worldwide.

After hearing about AI Image Generators, one of your webperf-minded developers has an idea: she takes the most homonymous gallery images of the “Rumba”, which only differ in lacing and sole colors, and makes them a training data set. After figuring out a few snags along the way, she arrives at her desired result: given a specific prompt including Seed, she can now make the AI Image Generator recreate any combination of the “Rumba” lacing and sole colors with almost perfect recall – visually indistinguishable from the original images.

She then proceeds to the 2nd phase of her idea: she deploys the AI Image Generator and her “Rumba” training data set in the regions “Europe-West” and “Asia-East” on GPU-powered hardware. She uses “Europe-West”, geographically close to her, has her own instance to generate prompts with seeds for all possible combinations of “Rumba” lacings and soles and performs visual QA on the results.

On “Asia-East”, an important growth market, she introduces a clever bit of Edge logic: for each image request for the “Rumba” that is a cache miss and that would have resulted in a request all the way through the Internet back to the DAM Origin in Amsterdam, she translates the asset request into a hash that she can map to a local Key-Value Store that contains the prompts that she has previously visually QA’ed and is thus certain will result in visually indistinguishable output. The request now causes an image being generated on a GPU in “Asia-East” based on the matched prompt plus seed and then being delivered back to the Edge and cached.

All traffic is now being kept in the region, the CDN Middle Mile is not traversed and CDN traffic bills begin to plummet because all that’s now needed to go over the wire are a few hundred bytes of plaintext prompts as Edge KV DBs sync their data. Your developer happily coins her new invention “Media Rehydration”, based on her prior and mostly unpleasant experiences with JS and the DOM.

The All-Seeing Eye Of The CDN

As a CDN vendor, such a development might seem worrying. However, there’s opportunity, too: as a CDN usually sees all of its customer’s image data and their traffic patterns, it is a rather trivial task for a CDN to run similarity analysis on a customer’s image contents grouped by paths, size and metadata. Thus, a CDN could quickly generate a hitlist of image asset groups which are ranked by possible traffic savings and similarity with a high chance for almost perfect recall and then introduce “Media Rehydration” as a paid traffic cost-savings and resilience add-on to their customer.

The CDN only has to deploy GPU-enabled infrastructure across its multiple regions, e.g. in all areas in which it already keeps Parent Cache servers. Here, incoming Edge requests for uncached images could be matched with pre-stored prompts plus seeds per highly similar image content per customer, and then the desired image could be rehydrated from the prompt and delivered back to the Edge and intermediary cached.

Given the already image-heavy traffic of the Internet and the rise of AI-generated videos in the new future which could multiply the traffic savings if similarly processed as described above, CDNs would save huge amounts of their own Middle Mile traffic, thus reducing their own infrastructure and operating costs.

Cold Storage

Another promising application of “Media Rehydration” in the future could be storage of image and video inventory that is part of long-tail, e.g. seasonal, contents and thus is not getting any live traffic at the moment. DAM and Cloud Storage providers could run similar analysis on customers’ image inventory as described above and if they discover contents of sufficient size and similarity and are certain that they can achieve almost perfect recall, they could find that training a dataset with prompts and seeds and storing it efficiently for the unlikely event that images will be re-requested again in the future could be significantly cheaper and easier than keeping the actual byte-for-byte image binary data.

Hot Servings

An even more long-term future opportunity could be Media Rehydration on client devices: in a few years time, mobile devices will be powerful enough to run AI Media Generators locally and we might ship apps that contain pre-trained models for media rehydration so that all we’d have to send over the network to these client devices are a few hundred bytes of plaintext prompts for them to show media to the user instead of having to ship dozens or even hundreds of megabytes of media data over the network.

Patching Voyager

While nowhere close to a realistic application today, let’s take our imagination one step further: AI Content Generators can also write and code, as seen recently in the most recent iteration of ChatGPT. Like image generators, they take a prompt to produce an output and can reliably reproduce the same output given the same prompt and seed.

There’s a deep-space exploration probe currently heading towards the Oort cloud, thus making it made-by-mankind’s furthest object from Earth. One of its solar panels gets struck by a micro-meteorite and suddenly a software patch is required to reroute power and system routines. Bandwidth to the deep-space probe is ~0.00016MBit, thus making a ~1KB software patch extremely difficult to send (it would take ~14.2h to send the patch, assuming no package loss).

If the probe contained enough computing power and had been launched with a highly specialized pre-trained ChatGPT’esque model, an engineer could instead work out a prompt plus seed locally that reliably produces the necessary ~1KB of code. The engineer would then only need to send a few dozen bytes of prompt plus seed to provide the probe with the software patch, which it would then recreate locally based on the prompt and use it to patch itself to continue its successful journey into interstellar space.

tl;dr:

Training AI content generators on specialized data sets with almost perfect recall then and deploying them close to users can circumvent the middle mile and thus improve resilience, web performance and reduce operating costs.

Web Performance Calendar