More than 99 human years are wasted every day because of uncompressed content.

Compressing your content using Gzip is a well-known best practice for achieving fast website performance. Unfortunately, a large percentage of websites are serving content uncompressed and many of the leading CDNs are part of the problem. As crazy as it sounds, most CDNs turn off Gzip by default. I decided to dig into the data we have available in the HTTP Archive to get a better look at the current state of Gzip on the web.

Background on Gzip

Gzip compression works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller. This form of compression is particularly well suited for the web because HTML, JavaScript and CSS files usually contain plenty of repeated strings, such as whitespace, tags, keywords and style definitions.

Browsers are able to indicate support for compression by sending an Accept-Encoding header in each HTTP request. When a web server sees this header in the request, it knows that it can compress the response. It then notifies the web client that the content was compressed via a Content-Encoding header in the response. Gzip was developed by the GNU project and was standardized by RFC 1952.

Since the HTTP Archive has resource-level data, it turns out to be a great way to see how many websites are serving uncompressed content. I looked at the data from the November 15th, 2012 run which crawled 292,999 websites. I then pulled out the hostnames to find the top offenders:

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE pageid >= 4147429 AND pageid <= 4463966 AND resp_content_encoding IS NULL 
  GROUP BY hostname 
  HAVING num > 1 
  ORDER BY num DESC;
Original Hostname # Ungzipped Requests
www.google-analytics.com 236,628
pagead2.googlesyndication.com 161,684
www.google.com 154,596
static.ak.fbcdn.net 115,270
b.scorecardresearch.com 90,560
p.twitter.com 78,123
ib.adnxs.com 74,270
ssl.gstatic.com 64,714
googleads.g.doubleclick.net 56,887
a0.twimg.com 51,832
s.ytimg.com 51,539
s0.2mdn.net 45,946
cm.g.doubleclick.net 45,332
www.facebook.com 41,289
pixel.quantserve.com 41,110
3.bp.blogspot.com 38,302
1.bp.blogspot.com 37,926
2.bp.blogspot.com 37,908

Of course, the results in the previous table are a bit misleading. CDNs are usually implemented using a CNAME record, which allows them to be white-labeled by their customers. To get an accurate list, we need to look up each of the DNS records. Once we unroll the CNAME records, we get a very different list, as shown in the following table.

Not surprisingly, Akamai does more traffic than anyone. Interestingly, while only 40% of the traffic served from akamai.net is Gzipped, notice Akamai is also listed in fourth place on the list with akamaiedge.net serving 72.5% Gzipped. From what I understand, Akamai uses the akamai.net domain for their legacy customers while akamaiedge.net is used for their newer customers.

Unrolled Hostname Total requests # Gzipped % Gzipped
akamai.net 1,729,000 693,507 40.1%
google.com 1,160,989 738,854 63.6%
doubleclick.net 458,776 386,121 84.2%
akamaiedge.net 454,605 329,810 72.5%
facebook.com 217,870 217,462 99.8%
cloudfront.net 210,126 26,271 12.5%
amazonaws.com 183,497 37,255 20.3%
edgecastcdn.net 152,074 41,779 27.5%
gstatic.com 118,647 44,965 37.9%
v2cdn.net 113,428 79,458 70.1%
googleusercontent.com 76,043 44,022 57.9%
netdna-cdn.com 73,642 16,111 21.9%
cotcdn.net 63,677 25,440 40.0%
footprint.net 61,281 4,253 6.9%
lxdns.com 57,856 7,805 13.5%
cdngc.net 57,216 18,096 31.6%
yahoodns.net 56,840 21,985 38.7%
shifen.com 56,737 27,223 48.0%
akadns.net 55,723 34,339 61.6%
llnwd.net 54,044 7,104 13.1%

Dealing with already compressed content

One flaw with the data so far is that we haven’t considered the type of content being served and whether it makes sense for that content to be Gzipped. While Gzip is great for compressing text formats like HTML, CSS and JavaScript, it shouldn’t necessarily be used for everything. Popular image formats used on the web, as well as videos, PDFs and other binary formats, are already compressed. This means Gzipping them won’t provide much additional benefit, and in some cases can actually make the files larger.

I ran a quick experiment using several hundred images from around the web of various sizes and types. The results show an average of 1% reduction in size when these already-compressed files are Gzipped. Considering the extra CPU overhead, it’s probably not worth doing. While the average was only 1%, I did find a handful of outlier images where using Gzip actually made a significant difference. One such example is the logo for Microsoft Azure. The image Microsoft uses is 19.29 KB. When Gzipped, the logo drops to 12.03 KB (a 37% reduction).

Ideally, the decision about whether to use Gzip should be made on a resource-by-resource basis. In practice, most people decide whether or not to Gzip a file based on its content-type and for the majority of cases, that’s a perfectly reasonable decision.

CPU-load

Compressing and decompressing content saves bandwidth, but uses additional CPU. This is almost always a worthwhile tradeoff given the speed of compression and the huge cost of doing anything over the network.

Size matters

Another thing my quick experiment confirmed is that Gzip isn’t great when dealing with really small files. Due to the overhead of compression and decompression, you should only Gzip files when it makes sense. Opinions vary on what the minimum value should be. Google recommends a minimum range between 150 and 1,000 bytes for Gzipping files. Akamai are more precise and claim that the overhead of compressing an object outweighs the performance gain at anything below 860 bytes. Steve Souders uses 1KB as his lower limit while Peter Cranstone, the co-inventor of mod_gzip says 10KB is the lowest practical limit. In practice, it probably doesn’t matter much which of these numbers you pick as long as it’s less than 1KB since it will most likely be transmitted via a single packet anyway.

Taking these factors into consideration, let’s update our query and filter our results to exclude images & other binary formats and limit to files larger than 1KB.

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE 
    pageid >= 4147429 
    AND pageid <= 4463966 
    AND resp_content_encoding IS NULL 
    AND mimeType IN (
      'text/html',
      'application/x-javascript',
      'text/javascript',
      'text/css', 
      'application/javascript', 
      'text/plain', 
      'text/xml', 
      'font/eot', 
      'application/xml', 
      'application/json', 
      'text/json', 
      'text/js') 
    AND respSize > 1024 
  GROUP BY hostname 
  HAVING num > 1
  ORDER BY num DESC;

Here are the results when you filter the results to only consider text-based resources and a minimum size of 1KB:

Hostname # Ungzipped Requests
http://cf.addthis.com 14188
http://optimized-by.rubiconproject.com 5226
http://ib.adnxs.com 4916
http://xslt.alexa.com 4565
http://tag.admeld.com 4331
http://a.adroll.com 4079
http://themes.googleusercontent.com 3938
http://s7.addthis.com 3617
http://counter.rambler.ru 3136
http://gslbeacon.lijit.com 3118
http://content.adriver.ru 3001
http://a.rfihub.com 2832
http://js.users.51.la 2703
http://c1.rfihub.net 2598
http://bdv.bidvertiser.com 2433
http://dnn506yrbagrg.cloudfront.net 2279
http://rcm.amazon.com 2165
http://webmedia.hrblock.com 2030
http://server.iad.liveperson.net 1945
http://c.cnzz.com 1894

I talked with someone on the Google Plus team and they were surprised to see their domain at the top of this list. They’re still not sure why so many requests are being served ungzipped but they are investigating the issue. I think it’s telling that even top-notch engineering companies like Google are still trying to get this right. To be fair, the only reason they are top of the list is because they use a single domain, as we’ll see when we roll up the hostnames.

Update 2/20/13: It turned out there was a bug in WebPagetest that was impacting the accuracy of this data. I’m appears that some headers where being hidden from the browser when loading over https. I have updated the data above which now shows that Google Plus isn’t the worst offender after all (they don’t even make the list). Sorry about that.

Hostname # Ungzipped Requests
akamai.net 41,918
cloudfront.net 30,107
amazonaws.com 23,947
akadns.net 17,715
google.com 13,546
cnzz.com 11,190
googleusercontent.com 10,895
akamaiedge.net 10,635
adriver.ru 6,425
edgecastcdn.net 5,900
liveperson.net 4,650
adnxs.com 4,008
llnwd.net 3,436
footprint.net 3,016
rambler.ru 2,970
51.la 2,968
yahoodns.net 2,391
lxdns.com 2,361
doubleclick.net 2,345
amazon.com 2,218

Takeaways

Doing this research was a great reminder to me of how lucky we are to have the HTTP Archive. It’s a great resource as it makes it easy to do quick analysis like this. Both the code and the data are open sourced so anyone can grab their own copy of the data to check my work, or do a deeper analysis.

The results themselves are pretty shocking. Gzip is one of the simplest optimizations for websites to employ. Turning on Gzip requires very little effort and the performance gains can be huge. So what’s going on? Why are CDNs not doing more to enable compression for their customers? Sadly, as it often turns out, to find the answer you simply need to follow the money. CDNs sell themselves as a tool for improving performance, but they also charge by the byte. The larger the files you send, the more money your CDN makes. This puts their business goals directly at odds with their marketing that says they want to help make your website fast. As a side note, Last-modified headers are another place where this conflict of interest exhibits itself. The shorter the cache life on your content, the more traffic your CDN gets to serve. Shorter TTL’s increase their revenue while hurting your website performance.

As website owners, it’s important for us to understand these business dynamics and be proactive to make sure best practices are being followed on our sites. The good news is that with Real User Measurement (RUM) it’s easier than ever to measure the actual performance that your visitors are experiencing. Less than a year ago there wasn’t a good RUM solution available on the market. Today, hundreds of sites are using Torbit Insight or a similar RUM tool to measure their site speed and correlate their website performance to their business metrics.

RUM is a great way to measure the actual results your CDN is delivering. Perhaps you’ll discover, like Wayfair, that you aren’t getting the performance gains from your CDN that you expect. As I tell people all the time, the first step to improving your speed is making sure you have accurate measurement. The second step is making sure you have covered the basics like enabling Gzip.

ABOUT THE AUTHOR

Josh Fraser (@joshfraser) is the co-founder and CEO of Torbit, a company that offers next generation web performance with a free Real User Measurement tool that allows you to correlate how your speed impacts your revenue. Torbit also offers Dynamic Content Optimization which can double the speed of an average site. Josh has been coding since he was 10 and is passionate about making the internet faster.