Is your CDN intentionally hurting your performance?

19thDec 2012 by Josh Fraser

ABOUT THE AUTHOR

Josh Fraser (@joshfraser) is the co-founder and CEO of Torbit, a company that offers next generation web performance with a free Real User Measurement tool that allows you to correlate how your speed impacts your revenue. Torbit also offers Dynamic Content Optimization which can double the speed of an average site. Josh has been coding since he was 10 and is passionate about making the internet faster.

More than 99 human years are wasted every day because of uncompressed content.

Compressing your content using Gzip is a well-known best practice for achieving fast website performance. Unfortunately, a large percentage of websites are serving content uncompressed and many of the leading CDNs are part of the problem. As crazy as it sounds, most CDNs turn off Gzip by default. I decided to dig into the data we have available in the HTTP Archive to get a better look at the current state of Gzip on the web.

Background on Gzip

Gzip compression works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller. This form of compression is particularly well suited for the web because HTML, JavaScript and CSS files usually contain plenty of repeated strings, such as whitespace, tags, keywords and style definitions.

Browsers are able to indicate support for compression by sending an Accept-Encoding header in each HTTP request. When a web server sees this header in the request, it knows that it can compress the response. It then notifies the web client that the content was compressed via a Content-Encoding header in the response. Gzip was developed by the GNU project and was standardized by RFC 1952.

Since the HTTP Archive has resource-level data, it turns out to be a great way to see how many websites are serving uncompressed content. I looked at the data from the November 15th, 2012 run which crawled 292,999 websites. I then pulled out the hostnames to find the top offenders:

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE pageid >= 4147429 AND pageid <= 4463966 AND resp_content_encoding IS NULL 
  GROUP BY hostname 
  HAVING num > 1 
  ORDER BY num DESC;

Original Hostname	# Ungzipped Requests
www.google-analytics.com	236,628
pagead2.googlesyndication.com	161,684
www.google.com	154,596
static.ak.fbcdn.net	115,270
b.scorecardresearch.com	90,560
p.twitter.com	78,123
ib.adnxs.com	74,270
ssl.gstatic.com	64,714
googleads.g.doubleclick.net	56,887
a0.twimg.com	51,832
s.ytimg.com	51,539
s0.2mdn.net	45,946
cm.g.doubleclick.net	45,332
www.facebook.com	41,289
pixel.quantserve.com	41,110
3.bp.blogspot.com	38,302
1.bp.blogspot.com	37,926
2.bp.blogspot.com	37,908

Of course, the results in the previous table are a bit misleading. CDNs are usually implemented using a CNAME record, which allows them to be white-labeled by their customers. To get an accurate list, we need to look up each of the DNS records. Once we unroll the CNAME records, we get a very different list, as shown in the following table.

Not surprisingly, Akamai does more traffic than anyone. Interestingly, while only 40% of the traffic served from akamai.net is Gzipped, notice Akamai is also listed in fourth place on the list with akamaiedge.net serving 72.5% Gzipped. From what I understand, Akamai uses the akamai.net domain for their legacy customers while akamaiedge.net is used for their newer customers.

Unrolled Hostname	Total requests	# Gzipped	% Gzipped
akamai.net	1,729,000	693,507	40.1%
google.com	1,160,989	738,854	63.6%
doubleclick.net	458,776	386,121	84.2%
akamaiedge.net	454,605	329,810	72.5%
facebook.com	217,870	217,462	99.8%
cloudfront.net	210,126	26,271	12.5%
amazonaws.com	183,497	37,255	20.3%
edgecastcdn.net	152,074	41,779	27.5%
gstatic.com	118,647	44,965	37.9%
v2cdn.net	113,428	79,458	70.1%
googleusercontent.com	76,043	44,022	57.9%
netdna-cdn.com	73,642	16,111	21.9%
cotcdn.net	63,677	25,440	40.0%
footprint.net	61,281	4,253	6.9%
lxdns.com	57,856	7,805	13.5%
cdngc.net	57,216	18,096	31.6%
yahoodns.net	56,840	21,985	38.7%
shifen.com	56,737	27,223	48.0%
akadns.net	55,723	34,339	61.6%
llnwd.net	54,044	7,104	13.1%

Dealing with already compressed content

One flaw with the data so far is that we haven’t considered the type of content being served and whether it makes sense for that content to be Gzipped. While Gzip is great for compressing text formats like HTML, CSS and JavaScript, it shouldn’t necessarily be used for everything. Popular image formats used on the web, as well as videos, PDFs and other binary formats, are already compressed. This means Gzipping them won’t provide much additional benefit, and in some cases can actually make the files larger.

I ran a quick experiment using several hundred images from around the web of various sizes and types. The results show an average of 1% reduction in size when these already-compressed files are Gzipped. Considering the extra CPU overhead, it’s probably not worth doing. While the average was only 1%, I did find a handful of outlier images where using Gzip actually made a significant difference. One such example is the logo for Microsoft Azure. The image Microsoft uses is 19.29 KB. When Gzipped, the logo drops to 12.03 KB (a 37% reduction).

Ideally, the decision about whether to use Gzip should be made on a resource-by-resource basis. In practice, most people decide whether or not to Gzip a file based on its content-type and for the majority of cases, that’s a perfectly reasonable decision.

CPU-load

Compressing and decompressing content saves bandwidth, but uses additional CPU. This is almost always a worthwhile tradeoff given the speed of compression and the huge cost of doing anything over the network.

Size matters

Another thing my quick experiment confirmed is that Gzip isn’t great when dealing with really small files. Due to the overhead of compression and decompression, you should only Gzip files when it makes sense. Opinions vary on what the minimum value should be. Google recommends a minimum range between 150 and 1,000 bytes for Gzipping files. Akamai are more precise and claim that the overhead of compressing an object outweighs the performance gain at anything below 860 bytes. Steve Souders uses 1KB as his lower limit while Peter Cranstone, the co-inventor of mod_gzip says 10KB is the lowest practical limit. In practice, it probably doesn’t matter much which of these numbers you pick as long as it’s less than 1KB since it will most likely be transmitted via a single packet anyway.

Taking these factors into consideration, let’s update our query and filter our results to exclude images & other binary formats and limit to files larger than 1KB.

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE 
    pageid >= 4147429 
    AND pageid <= 4463966 
    AND resp_content_encoding IS NULL 
    AND mimeType IN (
      'text/html',
      'application/x-javascript',
      'text/javascript',
      'text/css', 
      'application/javascript', 
      'text/plain', 
      'text/xml', 
      'font/eot', 
      'application/xml', 
      'application/json', 
      'text/json', 
      'text/js') 
    AND respSize > 1024 
  GROUP BY hostname 
  HAVING num > 1
  ORDER BY num DESC;

Here are the results when you filter the results to only consider text-based resources and a minimum size of 1KB:

Hostname	# Ungzipped Requests
http://cf.addthis.com	14188
http://optimized-by.rubiconproject.com	5226
http://ib.adnxs.com	4916
http://xslt.alexa.com	4565
http://tag.admeld.com	4331
http://a.adroll.com	4079
http://themes.googleusercontent.com	3938
http://s7.addthis.com	3617
http://counter.rambler.ru	3136
http://gslbeacon.lijit.com	3118
http://content.adriver.ru	3001
http://a.rfihub.com	2832
http://js.users.51.la	2703
http://c1.rfihub.net	2598
http://bdv.bidvertiser.com	2433
http://dnn506yrbagrg.cloudfront.net	2279
http://rcm.amazon.com	2165
http://webmedia.hrblock.com	2030
http://server.iad.liveperson.net	1945
http://c.cnzz.com	1894

I talked with someone on the Google Plus team and they were surprised to see their domain at the top of this list. They’re still not sure why so many requests are being served ungzipped but they are investigating the issue. I think it’s telling that even top-notch engineering companies like Google are still trying to get this right. To be fair, the only reason they are top of the list is because they use a single domain, as we’ll see when we roll up the hostnames.

Update 2/20/13: It turned out there was a bug in WebPagetest that was impacting the accuracy of this data. I’m appears that some headers where being hidden from the browser when loading over https. I have updated the data above which now shows that Google Plus isn’t the worst offender after all (they don’t even make the list). Sorry about that.

Hostname	# Ungzipped Requests
akamai.net	41,918
cloudfront.net	30,107
amazonaws.com	23,947
akadns.net	17,715
google.com	13,546
cnzz.com	11,190
googleusercontent.com	10,895
akamaiedge.net	10,635
adriver.ru	6,425
edgecastcdn.net	5,900
liveperson.net	4,650
adnxs.com	4,008
llnwd.net	3,436
footprint.net	3,016
rambler.ru	2,970
51.la	2,968
yahoodns.net	2,391
lxdns.com	2,361
doubleclick.net	2,345
amazon.com	2,218

Takeaways

Doing this research was a great reminder to me of how lucky we are to have the HTTP Archive. It’s a great resource as it makes it easy to do quick analysis like this. Both the code and the data are open sourced so anyone can grab their own copy of the data to check my work, or do a deeper analysis.

The results themselves are pretty shocking. Gzip is one of the simplest optimizations for websites to employ. Turning on Gzip requires very little effort and the performance gains can be huge. So what’s going on? Why are CDNs not doing more to enable compression for their customers? Sadly, as it often turns out, to find the answer you simply need to follow the money. CDNs sell themselves as a tool for improving performance, but they also charge by the byte. The larger the files you send, the more money your CDN makes. This puts their business goals directly at odds with their marketing that says they want to help make your website fast. As a side note, Last-modified headers are another place where this conflict of interest exhibits itself. The shorter the cache life on your content, the more traffic your CDN gets to serve. Shorter TTL’s increase their revenue while hurting your website performance.

As website owners, it’s important for us to understand these business dynamics and be proactive to make sure best practices are being followed on our sites. The good news is that with Real User Measurement (RUM) it’s easier than ever to measure the actual performance that your visitors are experiencing. Less than a year ago there wasn’t a good RUM solution available on the market. Today, hundreds of sites are using Torbit Insight or a similar RUM tool to measure their site speed and correlate their website performance to their business metrics.

RUM is a great way to measure the actual results your CDN is delivering. Perhaps you’ll discover, like Wayfair, that you aren’t getting the performance gains from your CDN that you expect. As I tell people all the time, the first step to improving your speed is making sure you have accurate measurement. The second step is making sure you have covered the basics like enabling Gzip.

Web Performance Calendar