Writing JavaScript benchmarks isn’t as simple as it seems. Even without touching the subject of potential cross-browser issues, there are a lot of pitfalls — booby traps, even — to look out for.

This is part of the reason why I created jsPerf, a simple web interface that allows you to very easily create and share test cases comparing the performance of different code snippets. There’s no need to worry about anything; just enter the code you would like to benchmark and have jsPerf create a test case for you which can be run across different browsers and devices.

Behind the scenes, jsPerf was initially using a JSLitmus-based benchmarking library which I named Benchmark.js. More and more features were added, and recently, John-David Dalton rewrote the whole thing from scratch. Benchmark.js has been getting better ever since.

This article will shed some light on the various gotchas in writing and running JavaScript benchmarks.

Benchmarking patterns

There are a lot of ways to run benchmarks on JavaScript snippets to test their performance. The most common pattern is the following:

Pattern A

var totalTime,
    start = new Date,
    iterations = 6;
while (iterations--) {
  // Code snippet goes here
}
// totalTime → the number of milliseconds it took to execute
// the code snippet 6 times
totalTime = new Date - start;
This places the code to be tested inside a loop and executes it a predefined number of times (in this case, 6). After that, the start date is subtracted from the end date to get the time taken to perform the operations. Pattern A is used in the popular SlickSpeed, Taskspeed, SunSpider, and Kraken benchmark suites.

The problem(s)

As browsers and devices get faster, benchmarks that use fixed iteration counts have a greater chance of producing 0 ms results, which are unusable.

Pattern B

Another approach is to calculate how many operations are performed in a period of time. This has the advantage of not requiring you to pinpoint a number of iterations, as in the previous example.

var hz,
    period,
    startTime = new Date,
    runs = 0;
do {
  // Code snippet goes here
  runs++;
  totalTime = new Date - startTime;
} while (totalTime < 1000);

// convert ms to seconds
totalTime /= 1000;

// period → how long per operation
period = totalTime / runs;

// hz → the number of operations per second
hz = 1 / period;

// can be shortened to
// hz = (runs * 1000) / totalTime;

This snippet executes the test code for about a second, i.e. until totalTime is greater than or equal to 1000 ms. Pattern B is used in Dromaeo and the V8 Benchmark Suite.

The problem(s)

When benchmarking, results will vary from test to re-test due to garbage collection, engine optimizations, and other background processes. Because of this variance a benchmark should be run several times to get an average result. V8 Suite only runs each benchmark once. Dromaeo runs each benchmark five times, but could do more in an effort to reduce its margin of error. One way would be to lower the minimum time a benchmark runs from 1000 ms to 50 ms, assuming a non-buggy timer, allowing more time for repeated runs.

Pattern C

JSLitmus is built around a combination of both these patterns. It uses pattern A to loop a test n times, but uses adaptive test cycles to dynamically increase n until a minimum test time, pattern B, is reached.

The problem(s)

JSLitmus avoids the issues of pattern A but shares the problems of pattern B. In an effort to increase result accuracy, JSLitmus calibrates results by taking the fastest of 3 empty test runs and subtracting the result from each benchmark result. Unfortunately, this technique — while intended to remove overhead cost — actually muddies the end result because “best of 3″ is not a statistically valid method. Even if JSLitmus ran benchmarks multiple times and subtracted the calibration average from the benchmark result average, the end result’s increased margin of error would swallow any hope of increased accuracy.

Pattern D

The drawbacks of patterns A, B, and C can be avoided by using function compilation and loop unrolling.

function test() {
  x == y;
}

while (iterations--) {
  test();
}

// ...would compile to →
var hz,
    startTime = new Date;

x == y;
x == y;
x == y;
x == y;
x == y;
// ...

hz = (runs * 1000) / (new Date - startTime);

This pattern compiles the tests unrolled to avoid looping and calibration.

The problem(s)

However, it also has its downsides. Compiling functions like this can  drastically increase memory usage and slow down your CPU. When you’re  repeating a test a few million times, you’re basically creating a very large string and compiling a massive function.

Another caveat when using loop unrolling is that tests can exit early via a return statement. There’s no point in compiling a million-line function that will return at line 3 anyway. It’s necessary to detect early exits and fall back to the while loop pattern (A) with loop calibration when needed.

Function body extraction

In Benchmark.js, a slightly different technique is used. You could say it uses the best parts of patterns A, B, C, and D. Because of memory concerns, we’re not unrolling loops. In order to reduce factors that might make results less accurate, and to allow tests to access local methods and variables, we extract the function body for each test. For example, when code like this is tested:

var x = 1,
    y = "1";

function test() {
  x == y;
}

while (iterations--) {
  test();
}

// ...would compile to →

var x = 1,
    y = "1";
while (iterations--) {
  x == y;
}

After that, Benchmark.js uses a similar technique to JSLitmus: we run the extracted code in a while loop (pattern A), repeat it until a minimum time is reached (pattern B), and repeat the whole thing multiple times to produce statistically significant results.

Some things to consider

Inaccurate millisecond timers

In some browser/OS combinations, the timers may be inaccurate because of various issues.

For example:

“When Windows XP boots, the typical default clock interrupt period is 10 milliseconds, although a period of 15 milliseconds is used on some systems. That means that every 10 milliseconds, the operating system receives an interrupt from the system timer hardware.”

Some older browsers (e.g. IE, Firefox 2) rely on the internal OS timers, meaning that every time you call new Date().getTime() it will just fetch it directly from the operating system. Obviously, if the internal timer only gets updated every 10 or 15 milliseconds, the uncertainty in the measurement increases and the accuracy of test results decreases significantly. We need to work around this.

Luckily, it’s possible to use JavaScript to get the smallest unit of measure. After that, we can use a little math to reduce the percentage uncertainty of our test results to 1%. To do this, we need to divide the smallest unit of measure by 2 to get the uncertainty. Let’s say we’re using IE6 on Windows XP, and the smallest unit of measure is 15 ms. In this case, the uncertainty equals 15 ms / 2 = 7.5 ms. We want this number to signify only 1%, so we just divide it by 0.01, which gives us the minimum test time required: 7.5 / 0.01 = 750 ms.

Alternative timers

When run with the --enable-benchmarking flag, Chrome and Chromium expose a chrome.Interval method, which can be used as a high-resolution microsecond timer.

While working on Benchmark.js, John-David Dalton stumbled across Java’s nanosecond timer and exposed it to JavaScript using a tiny Java applet. It would be interesting to see if there are more possibilities here using other browser plugins.

Using a higher resolution timer allows for sorter test times, which allows for larger sample sizes, which produces a smaller margin of error for the results.

Firebug disables Firefox’s JIT

Enabling the Firebug add-on effectively disables all of Firefox’s high-performance just-in-time (JIT) native code compilation, meaning you’ll be running the tests in the interpreter. In other words, your tests will run much slower than they would otherwise. You should always remember to disable Firebug before running benchmarks in Firefox.

Although the impact appears to be much smaller there, the same goes for other browser inspector tools, like WebKit’s Web Inspector or Opera’s Dragonfly. Avoid having these opened when running benchmarks, as it might influence the results.

Browser bugs and features

Benchmarks that have some kind of looping mechanism are susceptible to various browser quirks, as recently IE9′s dead-code-removal demonstrated. Bugs in Mozilla’s TraceMonkey engine, or Opera 11′s caching of qSA results can also throw a wrench into benchmark results. It’s important to keep that in mind when creating test cases.

Statistical significance

Most benchmarks/benchmarking scripts produce results that aren’t statistically significant. John Resig wrote about this before in his article on JavaScript benchmark quality. In short, it’s necessary to consider the margin of error of each result, and reduce it as much as possible. A larger sample size, composed of completed test runs, helps to reduce the margin of error.

Cross-browser testing

If you want to run benchmarks in different browsers and get reliable results, be sure to test in the real browsers. Do not rely on Internet Explorer’s compatibility modes — these differ from the actual browser versions they’re emulating.

Also, be aware of the fact that rather than limiting a script by time like all other browsers do, IE (up to version 8.0) limits a script to 5 million instructions. With modern hardware, a CPU-intensive script can trigger this in less than half a second. If you have a reasonably fast system, you may run into “Script Warning” dialogs in IE, in which case the best solution is to modify your Windows Registry to increase the number of operations. Luckily, Microsoft provides an easy of doing this; all you need to do is run a simple “Fix It” wizard. What’s even better is that this silly limitation is removed in IE9.

Conclusion

Whether you’re just running some benchmarks, writing your own test suite, or even coding your own benchmarking library — there’s more to JavaScript benchmarking than meets the eye. Benchmark.js and jsPerf are updated weekly with small bug fixes, new features, and clever tricks that improve the accuracy of the test results. If only the popular browser benchmarks would do the same…

ABOUT THE AUTHOR
Mathias Bynens photo

Mathias Bynens (@mathias) works as a freelance web developer in Belgium. He likes HTML, CSS, JavaScript and WPO. To help with those last two things, he created jsPerf a while ago.

John-David Dalton photo

John-David Dalton (@jdalton): my first JavaScript project was a Super Mario Bros. game engine I made in high school. I have always been drawn to JavaScript and other ECMAScript based languages. I spend most of my time tinkering with JavaScript frameworks, fixing bugs and running benchmarks. I love interacting with the JavaScript community and try to help as much as possible. I have a bachelors degree in Multimedia Instructional Design, an awesome wife, and a puppy.

20 Responses to “Bulletproof JavaScript benchmarks”

  1. Tweets that mention Performance Calendar » Bulletproof JavaScript benchmarks -- Topsy.com

    [...] This post was mentioned on Twitter by Matthew Podwysocki, Stoyan Stefanov, Jon Fox, jsPerf and others. jsPerf said: Bulletproof JavaScript benchmarks, by @jdalton and @mathias: http://perfplanet.com/201023 [...]

  2. John Haugeland

    The correct approach is to set iterations to ten thousand and stop involving known-problematic approach B.

    Please leave benchmarking to people who know what omicron time is.

  3. JavaScript Magazine Blog for JSMag » Blog Archive » Apple iAd is PastryKit 2.0, Game engines, Google Body Browser

    [...] – Pure JavaScript node.js PostgreSQL client Bulletproof JavaScript benchmarks (by Mathias Bynens and John-David Dalton) – takes a look at various ways to measure JavaScript performance accurately Thoughts on [...]

  4. Balázs Galambosi

    John Haugeland wrote:
    > The correct approach is to set iterations to ten thousand
    > and stop involving known-problematic approach B.

    Are you serious? This is a very unstable approach. For some tests you may end up getting the results for next Christmas. :) Especially if you want to test in older browsers. The hardest part is to be able to get results in a reasonable amount of time for unequally performant browsers. Think about it.

  5. Mathias Bynens

    Balázs: Your comment explains it very well :) However, I suspect John Haugeland was trolling, since the drawbacks of pattern B are described in the article.

  6. Jorge

    There’s another couple of things to consider when timing:

    1.- The OS scheduler is switching among tasks/processes all the time. When your (JS) app is deallocated (frozen) the ms clock keeps ticking.

    2.- You never know when is a GC cycle going to kick in, but when it does it frozens your app too for a while.

    These 2 things produce quite a lot of jitter in the timings, not because the clock is inaccurate, but because the code you’re trying to benchmark is not being run smooth and continuously but in bursts.

    You can see the effect very clearly by running this : https://gist.github.com/761127

    Jorge.

  7. Max

    I think the best pattern here is actually a variant of the A + B approach — basically doing something like running pattern B for 100ms, calculating the number of executions per second at the end, and using that to run a fixed number of iterations using the pattern A approach. Best of both worlds: you have the low overhead of pattern A, plus you’re guaranteed to hit a particular ballpark in terms of test duration.

  8. John-David Dalton

    @Jorge – There is a bit of deviation which is why benchmark.js averages 80 calculations of the min time needed to run a test and repeats a test run several times to create a sample size (5 through 120+, time permitting), calculates the standard deviation and margin of error (adding test runs to the sample until 5 seconds have passed).

    @Max – What you describe is how Pattern C works.

  9. Max

    @John-David

    Not quite. What pattern C does is run a test for n times, then n*2 times, then n*4 times, then n*8 times, etc until a minimum time is reached. My way is to run it for 100ms (by checking how long its been running in every iteration — pretty fast actually), then do some grade school math to figure out what n should be to get around 1000ms of test time in. I suppose the outcome is about the same as pattern C, but it takes 1100ms instead of…well, something a lot less deterministic.

  10. John-David Dalton

    @Max I see you are getting into the nitty-gritty of it. I was speaking from more of a generalized approach of guessing how many iterations is needed after an initial iteration(s). But yes, your approach would work and is close to how benchmark.js does it (after 5 iterations, with checks for 0ms results and those that approach Infinity ops/sec).

  11. Benchmark | osg.js

    [...] were written to compare the matrix inversion functions of osgjs and CubicVR and can be run. An article explains the concepts implemented at Benchmark.js and references an informative article from John [...]

  12. Travis Paul

    I cannot get the –enable-benchmarking flag in chromium to work, nor can I find any documentation about such a flag, please see my post linked below:

    http://groups.google.com/a/chromium.org/group/chromium-discuss/browse_thread/thread/02c2ee81f53362e0#

  13. Mathias Bynens

    Travis, the --enable-benchmarking flag is documented here: http://peter.sh/experiments/chromium-command-line-switches/#enable-benchmarking

    And here’s an explanation of how to run Chrome with command-line flags: http://www.chromium.org/developers/how-tos/run-chromium-with-flags

  14. Jack Tedrow

    You actually make it appear so easy with your presentation however I to find this matter to be actually one thing that I feel I would never understand. It sort of feels too complicated and very huge for me. I am taking a look ahead in your next put up, I will attempt to get the cling of it!

  15. discount codes for hotels in atlantic city

    abercrombie outletou so much for sharing this post.Your views truly

  16. thea

    Hi,,

    Is there any options to set the max Samples of testing? I just found minSamples in your documentation.

    Thanks!!!

  17. osgjs : benchmarking matrix inverse 4×4 | Loïc Dachary

    […] were written to compare the matrix inversion functions of osgjs and CubicVR and can be run. An article explains the concepts implemented at Benchmark.js and references an informative article from John […]

  18. www.youtube.com

    Thanks very nice blog!

    My site how to download from youtube (http://www.youtube.com)

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
And here's a tool to convert HTML entities