Web Performance Calendar

The speed geek's favorite time of year
2014 Edition
ABOUT THE AUTHOR

Andrea Giammarchi

Andrea Giammarchi (@WebReflection) is currently a Senior Software Engineer at Twitter, previously working at Facebook and also at Nokia where he was mainly focused on the Mobile HTML5 Here map engine. Active JS community contributor and web-standards promoter, Andrea often talked about performance optimizations specially for the mobile world with 6+ years of real-world experience.

Andrea is also passionate about modern Internet of Things topic and generic JavaScript performance on these constrained systems.

Someone might remember last year post of mine entitled Boosting UX via Delayed, Non Blocking, Enhancements; this year we are going to explore another technique that is neither debouncing nor throttling, a technique probably better suitable for non atomic I/O operations on IoT devices or, generally speaking, slow machines with low amount of RAM, a technique that will work like a charm anyway even in your daily computer or home server.

Asynchronous I/O

We can ask a generic drive (HD, SSD, SD card, …) to read the same file many times and here a dumb representation of such operation:

It does not matter when all read requests for the same file happen, the I/O will hold every request and satisfy each of them until no reads are needed anymore.
As we can see from the dumb chart there is an access time between the first request and its first delivery and during this time all other requests for the same file will be simply holded in a queue that will start being resolved once the access to such file has been released.
We can benchmark via nodejs this operation through the following code:

#!/usr/bin/env node
 
// file-io.js
 
// verify at least one file name has been specified
if (process.argv.length < 3) {
  console.warn('./file-io.js filename [loop times]');
  process.exit(1);
}
 
// benchmark variables
var
  file = process.argv[2],
  times = parseInt(process.argv[3] || 100, 10),
  fs = require('fs'),
  result = {
    err: 0,
    ok: 0
  }
;
 
// begin!
console.time('file I/O');
for (var i = 0; i < times; i++) {
  // read N times the file asynchronously
  fs.readFile(file, function (err, ok) {
    // increment errors or OKs count
    if (err) {
      result.err++;
    } else {
      result.ok++;
    }
    // if the sum is equivalent to times
    if (result.err + result.ok === times) {
      // stop the benchmark
      console.timeEnd('file I/O');
      // and show what happened
      console.log('OK: ' + result.ok);
      console.log('Errors: ' + result.err);
    }
  });
}

Saved as file-io.js, and enabling it as executable via chmod +x file-io.js, we can do a basic test via ./file-io.js file-io.js and see what happens.
The file is benchmarking itself, a very small file that will most likely be parked in our HD cache.

$ ./file-io.js file-io.js

file I/O: 5ms
OK: 100
Errors: 0

Not bad for my good old Macbook Pro, right? Now let’s try with the ECMAScript 6 PDF, around 7MB of file probably not suitable for most common and cheap HD cache.

$ ./file-io.js ~/Downloads/es6.pdf

file I/O: 560ms
OK: 100
Errors: 0

OK, half a second to serve them all is not bad too … how about we increase the amount of requests to 500 instead?

$ ./file-io.js ~/Downloads/es6.pdf 500

file I/O: 927ms
OK: 245
Errors: 255

Up to 1 second delay and with more errors than successes … what the hell happened? Maybe was the file too big?

$ ./file-io.js file-io.js 500

file I/O: 18ms
OK: 245
Errors: 255

So how can this be possible? Let’s try to understand what happens behind the scene in order to satisfy this operation:

  1. enough memory to hold all number of requests (point of failure)
  2. access speed needed to reach the first time the requested file (might cause previous point of failure)
  3. first request being served, other requests might keep queuing in the meanwhile and access the file (different point of failure)
  4. file is free for another read, next request will be satisfied

What actually happened, in terms of error, is EMFILE, open 'file-io.js', meaning that basically we reached the maximum amount of asynchronous access our system can perform to the same file.

Pre-bouncing Asynchronous I/O

While I’m pretty sure there’s a better term to describe such operation, the TL;DR version of this technique is about batching during the access time and releasing all possibly batched requests at once. During the release, other requests to the same file can be batched again.

How the batching works is straight forward: if there is an entry in a dictionary, add the callback to such entry and retain until that file hasn’t been read. This means that fewer I/O operations are needed and done asynchronously while more clients can ask for the same file around the same time.

The holdon utility

Instead of creating per each case some ad-hoc cache mechanism, I’ve been using a more abstract and generic utility which aim is to batch and release any amount and type of variables stored as property names, requiring very little changes to the previous file.

Such utility is called holdon.

First of all, you can install it via npm install holdon whenever you have also created the file-io.js file.
We can now require it and create a cache for a single entry, the callback used during a fs.readFile operation.

// create a cache with a callback key
// this key will retain all requests per each batch
// of I/O operations performed on the same file
var cache = require('holdon').create(['callback']);

We can also add a callback which aim is to be an intermediate between the direct fs.readFile call and its invoked callback once the file has been read.

// ask for a path and pass the callback
// will retain the callback during the loop
function readFile(path, callback) {
  // shortcut: true only if created first time
  if (cache.add(path, callback)) {
    // so read actually the file once
    // and once the operation is done
    fs.readFile(path, function (err, res) {
      // clean the cache and satisy all requests
      // ( shortcut: remove returns the object )
      cache.remove(path).callback.forEach(function (callback) {
        // satisfy all callback with the result
        callback(err, res);
      });
    });
  }
}

Our loop will now look basically the same, except it will use readFile function instead of using directly fs.readFile.

// begin!
console.time('file I/O');
for (var i = 0; i < times; i++) {
  readFile(file, function (err, ok) {
    // increment errors or OKs count
    if (err) {
      result.err++;
    } else {
      result.ok++;
    }
    // if the sum is equivalent to times
    if (result.err + result.ok === times) {
      // stop the benchmark
      console.timeEnd('file I/O');
      // and show what happened
      console.log('OK: ' + result.ok);
      console.log('Errors: ' + result.err);
    }
  });
}

Without wasting extra time retesting thins that were already working before, let’s benchmark things that were not working at all, like 500 clients accessing the same small file.

$ ./file-io.js file-io.js 500

file I/O: 2ms
OK: 500
Errors: 0

OK, it takes half of the time and it serves 5 times the amount of clients … this is not bad at all, isn’t it?
Now, how about that big file that was taking 1 second with errors when accessed 500 times ?

$ ./file-io.js ~/Downloads/es6.pdf 500

file I/O: 6ms
OK: 500
Errors: 0

Right, we reduced 100% amount of errors and we started serving 100 times faster the same content. Shall we try with 20000 requests instead? What about 100000? I let you enjoy the fact we bypassed the slow HD path and we are now holding clients on RAM so, as long as we have enough RAM to hold these clients, we can serve as many simultaneous requests for the same file as we can. We dropped the concurrent I/O access and for a non blocking read operation.

Testing on a Raspberry Pi

It’s always easy to compare things on powerful machines but it’s usually in constrained circumstances that we can find out how bad or actually good and working is a generic solution which aim is to boost performance (like … for real! Not only theoretically …).

I’ve found in the Raspberry Pi a good reference platform for this purpose since it has low amount of RAM and it works entirely on an SD Card, where no fast or huge HD cache is present and where the gap in milliseconds between different techniques usually indicates more reliably possible achievements if the target hardware is equivalently poor as Intel Edison, BeagleboneBlack, or Arduino Yun MIPS Linaro system and many others could be.

Following some Raspberry Pi result using the normal version based on direct fs.readFile.

# small file, default 100
$ ./file-io.js file-io.js

file I/O: 195ms
OK: 100
Errors: 0

# small file, forcing 500
$ ./file-io.js file-io.js 500

file I/O: 918ms
OK: 500
Errors: 0

# big file, forcing 60
$ ./file-io.js es6.pdf 60

file I/O: 2959ms
OK: 60
Errors: 0

Curiously, this small board managed to read without errors 500 times the small file without problems.
However, it required to be forced to maximum 60 operations since with 70 or more it was crashing every single time.
Remember the point of failure about thing that could be hold on RAM? Reached here.
Now let’s see what can holdon do here?

# small file, default 100
$ ./file-io.js file-io.js

file I/O: 39ms
OK: 100
Errors: 0

# small file, forcing 500
$ ./file-io.js file-io.js 500

file I/O: 47ms
OK: 500
Errors: 0

# big file, forcing 60
$ ./file-io.js es6.pdf 60

file I/O: 82ms
OK: 60
Errors: 0

# big file, forcing 500
$ ./file-io.js es6.pdf 500

file I/O: 89ms
OK: 500
Errors: 0

We have now the ability to serve the same file simultaneously in less than 100 milliseconds in a Raspberry Pi computer and 500 clients: concurrent file serving achievement unlocked!

… and not only files

The entire post is based on a single file serving, and reading it all at once.
While this was an easy way to show via benchmarks results and advantages on pre-bouncing same requests, it does not give any justice on real-world potentials this technique has.

As example, the reason the cache should be freed once a file has been read is that if we use same approach with streams we can satisfy all clients asking for that file between the initial request, and the one that has been performed instantly before the stream starts sending packets. All successive requests to the same file will perform the operation from the scratch, mixing regular async I/O with holdon ensuring best from both worlds. Moreover …

  • used with streams, it can batch concurrent requests per each chunk of file distributed over the network.
    In this way files will not be hold entirely on RAM but in small chunks, and for every device that is requesting same chunk.
  • if your node server transforms or compress files at runtime, this technique could be used to require such operation once per all requests related to that specific file (Markdown files or any runtime generated GET)
  • combined with watch techniques, we could use a cache to transform and optimize once every updated file queueing possible requests and serving them at once, creating our own little CDN-like infrastructure
  • On the client side, it can satisfy multiple requests to the same resource at once still taking advantage of the browser cache but avoiding the network stack at all if few parts of the app need same static asset or resource ( i.e. map tiles, static json archives, same GET request )

Caveats

As it is for everything that shiny, there are caveats to this technique. Actually, probably only one but it’s a very important one: reads are not atomic anyway.
In few words, if something like the following happens:

  1. User1 asks to read file.txt, an holdon cache is generated
  2. User2 asks to write file.txt, we let node I/O handle the case
  3. User3 asks to read file.txt, holdon will put in the cached queue, waiting for its first release

User3 will never see changes User2 made if those changes happen between User1 asking for the file, and the system took 1 second to open such file where in the middle User2, before User3, manage to queue to the IO to write such change … This might sound an edge case but the fact holdon makes read operation not really atomic anymore could cause troubles if atomicity is what you are looking for.
In such case, please use regular nodejs and system IO, it has been great, robust and stable until now, I am sure it will keep serving your intents properly without going too fancy 😉

In Summary

There is no universal silver bullet when it comes to performance and the hosting system constrains might be the key to chose between various techniques that solve similar problems in different ways.

What’s important to remember is that benchmarks on real targeted hardware are a very good indication of success or failure. We should always keep this in mind, specially when it comes to Mobile Web development: not everyone has the latest Android or iOS, grab a cheap second hand device and use it as performance reference for anything mobile related where RAM is low and I/O is slow.

The same rule applies for developer boards oriented software: it’s very possible to have robust micro systems able to scale for your house and your small business office need, we just need to find the right approach to solve the problem.