Web Performance Calendar

The speed geek's favorite time of year
2014 Edition
ABOUT THE AUTHOR

Paddy Ganti (@paddy_ganti) loves solving web performance problems. He worked on HTTP request reduction at Facebook and currently working on byte reduction at a SDAD company called Instart Logic. He is totally at home dealing with DNS, TCP and HTTP issues when not compelling his customers to tune their website for optimal performance. You can reach him at paddy.ganti@gmail.com

Season’s Greetings!

There are many reasons why your first byte can be slow but I am going to talk about a very specific interaction thats very well known to network geeks but can use some circulation among the front-end developers for it happens to be in the critical path of the browser. In particular this has a tendency to effect the boundaries at which SSL record layer hands off control to the HTTP layer.

It is not a bad idea to refresh your basics of TCP before diving in to read the rest of this article.

Background & Motivation

First, Let us understand a problem called Silly Window Syndrome that begat both Nagle’s Algorithm and Delayed Acknowledgements as solutions.

Visually the problem can be pictured as follows:

As you can see we spend 120 bytes of TCP header overhead to transport 1 byte of data. To prevent this gross inefficiency/underutilization of the network the following distributed solution was implemented:

Solution to SWS Problem:

  • Receiver
    – avoid advertising small TCP window advances
    – delay acknowledgements
  • Sender
    – delay transmission of partially filled segments until all previously transmitted data has been acknowledged

The BSD folks developed the client side solution and the server side was implemented by John Nagle. The deadlock introduced by the interaction of the two components was not noticed until it got deployed in the real world production networks.

Delayed Ackowledgements

Traditional TCP implementations hold up acknowledgements for up to 200ms, hoping to piggy-back ACK on a data bearing outgoing packet. However, an immediate ACK is generated after every second full size (MSS-sized) packet. The motivation for delayed ACKs is to save bandwidth. Now we incur the following costs due to its deployment:

* During slow start and congestion avoidance, congestion window growth is driven by the number of acknowledgements received. For bulk transfers, you tend to get an immediate ACKs every second packet – so the sender sees half as many ACKs. This means that slow start is significantly slower than the “double cwnd every roundtrip” that you read in textbooks. The naive expectation is to send 2,4,8,18 segments in the first 4 roundtrips whereas in reality you would see the pattern of 2,3,5,8 cumulative segments sent with this in effect. Also the effect on congestion avoidance is also significant: the usual TCP throughput bound (for high bandwidth, large receiver windows, low-but-nonzero packet loss) is reduced by a factor 1/sqrt(2) with delayed ACKs on.

* Various timer-related anomalies. For instance, if you were to let RTO fall below 200ms, the sender may decide a packet is lost when the receiver was simply holding up an ACK. Hence, TCP stacks generally don’t let RTO ever fall below 200ms, even when this would otherwise be desirable, such as on a small RTT link.

Linux modifies this delayed ack timeout to be adjusted to the inter-arrival spacing by sampling around a small time interval like 15-20 ms and enforces a typical maximum thats much less than 200 ms ( around 40 ms delay is common). However Windows Network stack sets the maximum timeout at 200 ms with a rapid increase during the connection life time.I am not

Note that in the case of request/response traffic such as HTTP, there is no hope
of piggybacking ACKs on data anyway, so one of the key motivations for delayed ACKs is gone.In relatively high-bandwidth links, the bandwidth cost of additional ACKs is small. Overall, it is almost certain that delayed ACKs are more complicated to implement for the (small) gain it warrants.

Nagle’s Algorithm

The first standard describing this algorithm is RFC 896 which says send no new segments (any size!), when new data arrives from the user while there are any unacknowledged segments (any size!). This worked well. However, RFC 1122 relaxed the original central clause and allows you to send out full segments early. Now the significance of the outstanding packet size became controversial among implementers as to what constitutes a full sized packet (MSS, PMTUD, receiver not knowing what sender’s MSS,etc). I wont go into the gory details of why MSS the calculation is messy but be assured that Nagle’s algorithm is not consistently implemented across stacks thanks to this confusion

One could argue that Nagle is irrelevant to today’s Internet, and that floods of small packets are no longer a problem worth solving. This argument fails on at least three grounds:

  1. Many people connect to the network over wireless links, which usually are both slow and shared
  2. Even on fast links, excessive use of small packets makes inefficient use of expensive resources, such as routers
  3. Nagles’ algorithm is a useful firewall against sloppy applications or complex bugs that would otherwise send too many tiny packets.

There was a proposal to rethink the Nagle algorithm which seems promising.Mac OS X is the only IS that integrated this modification into their kernel but given that this is a server side option, it would be interesting to see the same on Linux to gauge its impact

The Classic Deadlock Problem

Now that we have seen why the components of TCP lets see how their interaction causes a deadlock.Instead of a contrived example to demonstrate this classic problem , lets take a look at this problem in the wild. The thread has a lot of details from WPT waterfalls to tcpdumps. The TLDR can be read here if you are impatient

The problem can be summarized as follows:

Nagle’s Algorithm kicks in and starts to wait for its in-flight data to be acknowledged before sending more.Delayed ACK applies to a single packet and even number of full sized packets will trigger an ACK immediately but odd number of packets will have to pay the tax of delayed ack timeout which is 200ms in the case here using a windows WPT client

To summarize the performance-killing effect:

Nagle/Delayed ack Interaction:

  • Nagle won’t send the last bit of data until it gets an ACK
  • Delayed ACK won’t send that ACK until it gets some response data
  • TCP won’t release the response to the application until it gets all the data acknowledged
    – and so on till a timeout occurs

Standard Solution

This classic problem does not occur with non-persistent HTTP requests because closing the TCP connection also immediately
sends any data waiting for transmission. For persistent connection this can be resolved by disabling Nagle’s algorithm,
thus disabling the aspect of SWS avoidance which interferes with performance. If traffic is predominantly HTTP based, disabling Nagle’s algorithm in the TCP stack may generate a slightly larger number of packets but throughput will usually be better. Routinely people do enable a socket option called TCP_NODELAY that effectively disabled Nagle’s algorithm and lets the server send the data without waiting or the acknowledgement from the client

Analytically if the bottleneck is around 200 ms per transaction then you can only do 5 such transactions per second compared to at least 15 (assuming each transaction takes 62 ms) as seen in the following example

Nagle Turned ON:

<

Nagle Turned OFF:

Returning to the problem posted by the original poster the solution of disabling nagle did not work as intended and needed a bugfix from nginx folks that made it work

Variants of the Problem

This problem albeit reported in October 2014 has a precedent in 1997 when Heidemann investigated performance problems of persistent connections. Specifically, an object that has an odd number of full MSS packets followed by a short packet on a persistent connection will lead to a 200ms delay, which doesn’t occur if the object had an even number of large packets. This is common at the beginning or the end with a potential in between :

Short initial segment problem:

  • Slow start kicks off with initial congestion window (was 2 in his time and 10 currenly)
  • TCP delays ACK’s up to 200ms but acknowledges at least every other full segment if the number of segments is even but otherwise a delay between 1 and 3 packets if the first packet is less than the MSS
  • The broader observation made in the paper is that delaying ACK (hope to piggyback) is rarely successful in request/response traffic. (They note that delayed ACK is also bad for FTP).

Odd/Short-Final-Segment problem

Suppose have odd number of full segments and a short final segment. The sender won’t sent this final segment if it is small (less than half the clients advertised window) because of silly-window-avoidance+ Nagle until it sees an ACK. But, for same reasons as above, the client will delay ACK. So again, about 200ms delay.

Slow Start Restart Problem

If all data is acknowledged and no more data is sent for one retransmission time out period, then congestion window is set back to initial value and repeat slow-start as the network information might be out of date

The above reason is why persistent connections are not as effective as one might think. People usually disable the socket option tcp_slow_start_after_idle to prevent this behavior and is important for single connection transfers like SPDY

SSL Impact

My own brush with this problem took the form pictured below:

The two gray shaded areas above represent the slowness during SSL handshake with a WPT client and we were at a loss to understand why SSL handshake was so slow particularly when dealing with Windows clients and now you all know the solution

The Solution Space

The best way to attack this problem is in the application space rather than tweaking global socket transport. For example any application that is doing a write can actually selectively disable Nagle ONLY on the last write of the application which improves the performance while not sacrificing the benefits.

Most of the times, these days the underlying framework does the writes for you (like nginx, apache, openssl, etc) but given the rising trend of microservices the following tactics can be employed to avoid these deadlocks if you are writing to the network

Application Write Strategies :

  • Use Vectored (Scatter/Gather) I/O if a single module does all the writes to network
  • Use Coalesced Writes if multiple modules are writing to the socket
  • Use Double Buffering after Nagle is turned OFF with the cost of extra packet
  • Investigate TCP_CORK and MSG_MORE Semantics as it befits your application to make sure you never send any segments smaller than MSS
  • Microsoft and Apple can reduce their Delayed ACK timeout implementation as 200 ms is an arbitrary choice based on original estimate for inter-arrival time of segments by choosing to use an adaptive policy
  • Linux can definitely implement the modifications to Nagle mentioned in Minshall & Mogul albeit wont help the SSL interaction issues

Final Thoughts & Conclusion

None of the classic TCP modeling efforts factor in the effect of delayed acknowledgements and please note that most web traffic flows never leave slow start phase so the impact of delayed acknowledgements is pretty severe on short web flows

Most people tend to disable Nagle rather than disable delayed acknowledgements because they can control the server side easily and have no control on the clients.There is a control knob at the edge servers of your CDN where I have seen this being disabled with no deleterious effects

So the last frontier is to selectively enable them at the client rather than pay the tax of these messy interactions. Personally I would like to study the effects on how much do we lose (my guess is a max of 1-2 RTTs for long flows and nothing for short flows) if we indeed did disable delayed acknowledgements. This helps put a number to the cost of SWS avoidance and we can do a cost benefit analysis.

I leave you with a note by John Nagle on the same :

A delayed ACK is a bet. You’re betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That’s just wrong.
The right answer is to track wins and losses on delayed and non-delayed ACKs. Don’t turn on ACK delay unless you’re sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

I should have pushed for this in the 1980s.

Thank you for reading thus far and I hope this conveyed the messy and somewhat surprising interactions among TCP components that affect your web performance timings.

Happy Holidays and Hopefully 2015 will see less of these problems