[TL:DR]: Your NIC offload settings can interfere with certain mechanisms like Early Flush, HEAD flush, progressive HTML flush inducing delays

Background & Motivation

Like all webperf geeks I read the article with great interest. While the main takeaway of the article was well taken about compression, I really did not grok why each JS resource had such a long green bar (TTFB) despite discounting for other effects

The first step is to reproduce the problem so I went ahead and downloaded their core experiment files that can be found at

  • Per File : https://ka-spdytest.appspot.com/perfile-files/index.html
  • Manual: https://ka-spdytest.appspot.com/manual-files/index.html

Sure enough I could reproduce the same on my end which can be pictured as follows:

I understand that KA runs their application on Google App Engner (GAE) whereas I was trying it on nginx 1.9.7 with http/2 support but the core problem for me remained in that why is it the first byte for small JS files is on the order of 40ms+ when my server is around 7 ms away

Debugging it Further

Since nginx gives you detailed timers, I enabled those to see if the delay was in nginx side but I found every request was instantaneously served from the perspective of the webserver, so the delays were downstream to it

The next step was to take packet captures on both the server side and client side to see where the delay occurred and took a look at the sequence graph to see if anything sticks out

Nothing major gives there except for some minor sporadic hiccup whereas our WPT waterfall consistently showed a first byte delay for every single JS

I tried some perumatations of Nagle, TCP_NOPUSH and sendfile to no major effect.

Where else can delays occur

Now let us think about other places where delays can occur : Between the IP stack and the network interface controller (NIC) lies the driver queue.This queue is typically implemented as a first-in, first-out (FIFO) fixed size buffer.Packets added to the driver queue by the IP stack are dequeued by the hardware driver
(sent across a PCI bus to the NIC hardware for sending). The SKBs are like file descriptors except they point to socket data

In order to avoid the overhead associated with a large number of packets on the transmit path, modern kernels implement several optimizations
like TCP segmentation offload (TSO) and generic segmentation offload (GSO) which saves CPU cycles and bus bandwidth. Here’s a high level view of the offload components

The Solution

Now the key part to leverage from the above discussion is the fact that the offload features are effective only if your application buffer writes are greater than MTU (you can instrument your application write sizes to verify of this is true). In the case of our original JS files are mostly 1 packet per JS. I disabled the TSO, GSO and other offload features and tried the same URL with the following results

As you can see all the first byte overhead vanishes and all JS files take less than 10 ms to download. The leap of faith that it must be the TSO+GSO causing the delay was based on my previous two such experiences:

  • When Facebook moved away from COTS loadbalancer to a home grown software the early flush was delayed similarly due to TSO+GSO being ON
  • When you progressively render HTML by flushing chunks of HTML it is also necessary to turn OFF these Tcp Offload features

Final Thoughts & Conclusion

If your waterfall optimization tries to move the stair case type waterfall to a left aligned one as follows

then you might be flushing buffers in your application land well and your logs might tell you its happening but its good to check your offload settings using a tool like ethtool.

Please disable TSO, GSO and other offload features on your web tier if you care about performance to the last degree. The extra load on CPU is minimal when compared to the latency gains. If you are running a FTP server or a big object storage then you may have a case to have it on but for most of modern web this is an anti-pattern


Paddy Ganti (@paddy_ganti) loves solving web performance problems. He worked on HTTP request reduction at Facebook and currently working on byte reduction at a SDAD company called Instart Logic. He is totally at home dealing with DNS, TCP and HTTP issues when not compelling his customers to tune their website for optimal performance. You can reach him at paddy.ganti@gmail.com