Josh Fraser (@joshfraser) is the co-founder and CEO of Torbit, a company that offers next generation web performance with a free Real User Measurement tool that allows you to correlate how your speed impacts your revenue. Torbit also offers Dynamic Content Optimization which can double the speed of an average site. Josh has been coding since he was 10 and is passionate about making the internet faster.
More than 99 human years are wasted every day because of uncompressed content.
Compressing your content using Gzip is a well-known best practice for achieving fast website performance. Unfortunately, a large percentage of websites are serving content uncompressed and many of the leading CDNs are part of the problem. As crazy as it sounds, most CDNs turn off Gzip by default. I decided to dig into the data we have available in the HTTP Archive to get a better look at the current state of Gzip on the web.
Background on Gzip
Gzip compression works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller. This form of compression is particularly well suited for the web because HTML, JavaScript and CSS files usually contain plenty of repeated strings, such as whitespace, tags, keywords and style definitions.
Browsers are able to indicate support for compression by sending an Accept-Encoding header in each HTTP request. When a web server sees this header in the request, it knows that it can compress the response. It then notifies the web client that the content was compressed via a Content-Encoding header in the response. Gzip was developed by the GNU project and was standardized by RFC 1952.
Since the HTTP Archive has resource-level data, it turns out to be a great way to see how many websites are serving uncompressed content. I looked at the data from the November 15th, 2012 run which crawled 292,999 websites. I then pulled out the hostnames to find the top offenders:
SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num FROM requests WHERE pageid >= 4147429 AND pageid <= 4463966 AND resp_content_encoding IS NULL GROUP BY hostname HAVING num > 1 ORDER BY num DESC;
Original Hostname | # Ungzipped Requests |
---|---|
www.google-analytics.com | 236,628 |
pagead2.googlesyndication.com | 161,684 |
www.google.com | 154,596 |
static.ak.fbcdn.net | 115,270 |
b.scorecardresearch.com | 90,560 |
p.twitter.com | 78,123 |
ib.adnxs.com | 74,270 |
ssl.gstatic.com | 64,714 |
googleads.g.doubleclick.net | 56,887 |
a0.twimg.com | 51,832 |
s.ytimg.com | 51,539 |
s0.2mdn.net | 45,946 |
cm.g.doubleclick.net | 45,332 |
www.facebook.com | 41,289 |
pixel.quantserve.com | 41,110 |
3.bp.blogspot.com | 38,302 |
1.bp.blogspot.com | 37,926 |
2.bp.blogspot.com | 37,908 |
Of course, the results in the previous table are a bit misleading. CDNs are usually implemented using a CNAME record, which allows them to be white-labeled by their customers. To get an accurate list, we need to look up each of the DNS records. Once we unroll the CNAME records, we get a very different list, as shown in the following table.
Not surprisingly, Akamai does more traffic than anyone. Interestingly, while only 40% of the traffic served from akamai.net is Gzipped, notice Akamai is also listed in fourth place on the list with akamaiedge.net serving 72.5% Gzipped. From what I understand, Akamai uses the akamai.net domain for their legacy customers while akamaiedge.net is used for their newer customers.
Unrolled Hostname | Total requests | # Gzipped | % Gzipped |
---|---|---|---|
akamai.net | 1,729,000 | 693,507 | 40.1% |
google.com | 1,160,989 | 738,854 | 63.6% |
doubleclick.net | 458,776 | 386,121 | 84.2% |
akamaiedge.net | 454,605 | 329,810 | 72.5% |
facebook.com | 217,870 | 217,462 | 99.8% |
cloudfront.net | 210,126 | 26,271 | 12.5% |
amazonaws.com | 183,497 | 37,255 | 20.3% |
edgecastcdn.net | 152,074 | 41,779 | 27.5% |
gstatic.com | 118,647 | 44,965 | 37.9% |
v2cdn.net | 113,428 | 79,458 | 70.1% |
googleusercontent.com | 76,043 | 44,022 | 57.9% |
netdna-cdn.com | 73,642 | 16,111 | 21.9% |
cotcdn.net | 63,677 | 25,440 | 40.0% |
footprint.net | 61,281 | 4,253 | 6.9% |
lxdns.com | 57,856 | 7,805 | 13.5% |
cdngc.net | 57,216 | 18,096 | 31.6% |
yahoodns.net | 56,840 | 21,985 | 38.7% |
shifen.com | 56,737 | 27,223 | 48.0% |
akadns.net | 55,723 | 34,339 | 61.6% |
llnwd.net | 54,044 | 7,104 | 13.1% |
Dealing with already compressed content
One flaw with the data so far is that we haven’t considered the type of content being served and whether it makes sense for that content to be Gzipped. While Gzip is great for compressing text formats like HTML, CSS and JavaScript, it shouldn’t necessarily be used for everything. Popular image formats used on the web, as well as videos, PDFs and other binary formats, are already compressed. This means Gzipping them won’t provide much additional benefit, and in some cases can actually make the files larger.
I ran a quick experiment using several hundred images from around the web of various sizes and types. The results show an average of 1% reduction in size when these already-compressed files are Gzipped. Considering the extra CPU overhead, it’s probably not worth doing. While the average was only 1%, I did find a handful of outlier images where using Gzip actually made a significant difference. One such example is the logo for Microsoft Azure. The image Microsoft uses is 19.29 KB. When Gzipped, the logo drops to 12.03 KB (a 37% reduction).
Ideally, the decision about whether to use Gzip should be made on a resource-by-resource basis. In practice, most people decide whether or not to Gzip a file based on its content-type and for the majority of cases, that’s a perfectly reasonable decision.
CPU-load
Compressing and decompressing content saves bandwidth, but uses additional CPU. This is almost always a worthwhile tradeoff given the speed of compression and the huge cost of doing anything over the network.
Size matters
Another thing my quick experiment confirmed is that Gzip isn’t great when dealing with really small files. Due to the overhead of compression and decompression, you should only Gzip files when it makes sense. Opinions vary on what the minimum value should be. Google recommends a minimum range between 150 and 1,000 bytes for Gzipping files. Akamai are more precise and claim that the overhead of compressing an object outweighs the performance gain at anything below 860 bytes. Steve Souders uses 1KB as his lower limit while Peter Cranstone, the co-inventor of mod_gzip says 10KB is the lowest practical limit. In practice, it probably doesn’t matter much which of these numbers you pick as long as it’s less than 1KB since it will most likely be transmitted via a single packet anyway.
Taking these factors into consideration, let’s update our query and filter our results to exclude images & other binary formats and limit to files larger than 1KB.
SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num FROM requests WHERE pageid >= 4147429 AND pageid <= 4463966 AND resp_content_encoding IS NULL AND mimeType IN ( 'text/html', 'application/x-javascript', 'text/javascript', 'text/css', 'application/javascript', 'text/plain', 'text/xml', 'font/eot', 'application/xml', 'application/json', 'text/json', 'text/js') AND respSize > 1024 GROUP BY hostname HAVING num > 1 ORDER BY num DESC;
Here are the results when you filter the results to only consider text-based resources and a minimum size of 1KB:
Hostname | # Ungzipped Requests |
---|---|
http://cf.addthis.com | 14188 |
http://optimized-by.rubiconproject.com | 5226 |
http://ib.adnxs.com | 4916 |
http://xslt.alexa.com | 4565 |
http://tag.admeld.com | 4331 |
http://a.adroll.com | 4079 |
http://themes.googleusercontent.com | 3938 |
http://s7.addthis.com | 3617 |
http://counter.rambler.ru | 3136 |
http://gslbeacon.lijit.com | 3118 |
http://content.adriver.ru | 3001 |
http://a.rfihub.com | 2832 |
http://js.users.51.la | 2703 |
http://c1.rfihub.net | 2598 |
http://bdv.bidvertiser.com | 2433 |
http://dnn506yrbagrg.cloudfront.net | 2279 |
http://rcm.amazon.com | 2165 |
http://webmedia.hrblock.com | 2030 |
http://server.iad.liveperson.net | 1945 |
http://c.cnzz.com | 1894 |
I talked with someone on the Google Plus team and they were surprised to see their domain at the top of this list. They’re still not sure why so many requests are being served ungzipped but they are investigating the issue. I think it’s telling that even top-notch engineering companies like Google are still trying to get this right. To be fair, the only reason they are top of the list is because they use a single domain, as we’ll see when we roll up the hostnames.
Update 2/20/13: It turned out there was a bug in WebPagetest that was impacting the accuracy of this data. I’m appears that some headers where being hidden from the browser when loading over https. I have updated the data above which now shows that Google Plus isn’t the worst offender after all (they don’t even make the list). Sorry about that.
Hostname | # Ungzipped Requests |
---|---|
akamai.net | 41,918 |
cloudfront.net | 30,107 |
amazonaws.com | 23,947 |
akadns.net | 17,715 |
google.com | 13,546 |
cnzz.com | 11,190 |
googleusercontent.com | 10,895 |
akamaiedge.net | 10,635 |
adriver.ru | 6,425 |
edgecastcdn.net | 5,900 |
liveperson.net | 4,650 |
adnxs.com | 4,008 |
llnwd.net | 3,436 |
footprint.net | 3,016 |
rambler.ru | 2,970 |
51.la | 2,968 |
yahoodns.net | 2,391 |
lxdns.com | 2,361 |
doubleclick.net | 2,345 |
amazon.com | 2,218 |
Takeaways
Doing this research was a great reminder to me of how lucky we are to have the HTTP Archive. It’s a great resource as it makes it easy to do quick analysis like this. Both the code and the data are open sourced so anyone can grab their own copy of the data to check my work, or do a deeper analysis.
The results themselves are pretty shocking. Gzip is one of the simplest optimizations for websites to employ. Turning on Gzip requires very little effort and the performance gains can be huge. So what’s going on? Why are CDNs not doing more to enable compression for their customers? Sadly, as it often turns out, to find the answer you simply need to follow the money. CDNs sell themselves as a tool for improving performance, but they also charge by the byte. The larger the files you send, the more money your CDN makes. This puts their business goals directly at odds with their marketing that says they want to help make your website fast. As a side note, Last-modified headers are another place where this conflict of interest exhibits itself. The shorter the cache life on your content, the more traffic your CDN gets to serve. Shorter TTL’s increase their revenue while hurting your website performance.
As website owners, it’s important for us to understand these business dynamics and be proactive to make sure best practices are being followed on our sites. The good news is that with Real User Measurement (RUM) it’s easier than ever to measure the actual performance that your visitors are experiencing. Less than a year ago there wasn’t a good RUM solution available on the market. Today, hundreds of sites are using Torbit Insight or a similar RUM tool to measure their site speed and correlate their website performance to their business metrics.
RUM is a great way to measure the actual results your CDN is delivering. Perhaps you’ll discover, like Wayfair, that you aren’t getting the performance gains from your CDN that you expect. As I tell people all the time, the first step to improving your speed is making sure you have accurate measurement. The second step is making sure you have covered the basics like enabling Gzip.