More than 99 human years are wasted every day because of uncompressed content.

Compressing your content using Gzip is a well-known best practice for achieving fast website performance. Unfortunately, a large percentage of websites are serving content uncompressed and many of the leading CDNs are part of the problem. As crazy as it sounds, most CDNs turn off Gzip by default. I decided to dig into the data we have available in the HTTP Archive to get a better look at the current state of Gzip on the web.

Background on Gzip

Gzip compression works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller. This form of compression is particularly well suited for the web because HTML, JavaScript and CSS files usually contain plenty of repeated strings, such as whitespace, tags, keywords and style definitions.

Browsers are able to indicate support for compression by sending an Accept-Encoding header in each HTTP request. When a web server sees this header in the request, it knows that it can compress the response. It then notifies the web client that the content was compressed via a Content-Encoding header in the response. Gzip was developed by the GNU project and was standardized by RFC 1952.

Since the HTTP Archive has resource-level data, it turns out to be a great way to see how many websites are serving uncompressed content. I looked at the data from the November 15th, 2012 run which crawled 292,999 websites. I then pulled out the hostnames to find the top offenders:

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE pageid >= 4147429 AND pageid <= 4463966 AND resp_content_encoding IS NULL 
  GROUP BY hostname 
  HAVING num > 1 
  ORDER BY num DESC;
Original Hostname # Ungzipped Requests
www.google-analytics.com 236,628
pagead2.googlesyndication.com 161,684
www.google.com 154,596
static.ak.fbcdn.net 115,270
b.scorecardresearch.com 90,560
p.twitter.com 78,123
ib.adnxs.com 74,270
ssl.gstatic.com 64,714
googleads.g.doubleclick.net 56,887
a0.twimg.com 51,832
s.ytimg.com 51,539
s0.2mdn.net 45,946
cm.g.doubleclick.net 45,332
www.facebook.com 41,289
pixel.quantserve.com 41,110
3.bp.blogspot.com 38,302
1.bp.blogspot.com 37,926
2.bp.blogspot.com 37,908

Of course, the results in the previous table are a bit misleading. CDNs are usually implemented using a CNAME record, which allows them to be white-labeled by their customers. To get an accurate list, we need to look up each of the DNS records. Once we unroll the CNAME records, we get a very different list, as shown in the following table.

Not surprisingly, Akamai does more traffic than anyone. Interestingly, while only 40% of the traffic served from akamai.net is Gzipped, notice Akamai is also listed in fourth place on the list with akamaiedge.net serving 72.5% Gzipped. From what I understand, Akamai uses the akamai.net domain for their legacy customers while akamaiedge.net is used for their newer customers.

Unrolled Hostname Total requests # Gzipped % Gzipped
akamai.net 1,729,000 693,507 40.1%
google.com 1,160,989 738,854 63.6%
doubleclick.net 458,776 386,121 84.2%
akamaiedge.net 454,605 329,810 72.5%
facebook.com 217,870 217,462 99.8%
cloudfront.net 210,126 26,271 12.5%
amazonaws.com 183,497 37,255 20.3%
edgecastcdn.net 152,074 41,779 27.5%
gstatic.com 118,647 44,965 37.9%
v2cdn.net 113,428 79,458 70.1%
googleusercontent.com 76,043 44,022 57.9%
netdna-cdn.com 73,642 16,111 21.9%
cotcdn.net 63,677 25,440 40.0%
footprint.net 61,281 4,253 6.9%
lxdns.com 57,856 7,805 13.5%
cdngc.net 57,216 18,096 31.6%
yahoodns.net 56,840 21,985 38.7%
shifen.com 56,737 27,223 48.0%
akadns.net 55,723 34,339 61.6%
llnwd.net 54,044 7,104 13.1%

Dealing with already compressed content

One flaw with the data so far is that we haven’t considered the type of content being served and whether it makes sense for that content to be Gzipped. While Gzip is great for compressing text formats like HTML, CSS and JavaScript, it shouldn’t necessarily be used for everything. Popular image formats used on the web, as well as videos, PDFs and other binary formats, are already compressed. This means Gzipping them won’t provide much additional benefit, and in some cases can actually make the files larger.

I ran a quick experiment using several hundred images from around the web of various sizes and types. The results show an average of 1% reduction in size when these already-compressed files are Gzipped. Considering the extra CPU overhead, it’s probably not worth doing. While the average was only 1%, I did find a handful of outlier images where using Gzip actually made a significant difference. One such example is the logo for Microsoft Azure. The image Microsoft uses is 19.29 KB. When Gzipped, the logo drops to 12.03 KB (a 37% reduction).

Ideally, the decision about whether to use Gzip should be made on a resource-by-resource basis. In practice, most people decide whether or not to Gzip a file based on its content-type and for the majority of cases, that’s a perfectly reasonable decision.

CPU-load

Compressing and decompressing content saves bandwidth, but uses additional CPU. This is almost always a worthwhile tradeoff given the speed of compression and the huge cost of doing anything over the network.

Size matters

Another thing my quick experiment confirmed is that Gzip isn’t great when dealing with really small files. Due to the overhead of compression and decompression, you should only Gzip files when it makes sense. Opinions vary on what the minimum value should be. Google recommends a minimum range between 150 and 1,000 bytes for Gzipping files. Akamai are more precise and claim that the overhead of compressing an object outweighs the performance gain at anything below 860 bytes. Steve Souders uses 1KB as his lower limit while Peter Cranstone, the co-inventor of mod_gzip says 10KB is the lowest practical limit. In practice, it probably doesn’t matter much which of these numbers you pick as long as it’s less than 1KB since it will most likely be transmitted via a single packet anyway.

Taking these factors into consideration, let’s update our query and filter our results to exclude images & other binary formats and limit to files larger than 1KB.

SELECT substring_index(urlShort, '/', 3) AS hostname, COUNT(*) AS num 
FROM requests 
  WHERE 
    pageid >= 4147429 
    AND pageid <= 4463966 
    AND resp_content_encoding IS NULL 
    AND mimeType IN (
      'text/html',
      'application/x-javascript',
      'text/javascript',
      'text/css', 
      'application/javascript', 
      'text/plain', 
      'text/xml', 
      'font/eot', 
      'application/xml', 
      'application/json', 
      'text/json', 
      'text/js') 
    AND respSize > 1024 
  GROUP BY hostname 
  HAVING num > 1
  ORDER BY num DESC;

Here are the results when you filter the results to only consider text-based resources and a minimum size of 1KB:

Hostname # Ungzipped Requests
http://cf.addthis.com 14188
http://optimized-by.rubiconproject.com 5226
http://ib.adnxs.com 4916
http://xslt.alexa.com 4565
http://tag.admeld.com 4331
http://a.adroll.com 4079
http://themes.googleusercontent.com 3938
http://s7.addthis.com 3617
http://counter.rambler.ru 3136
http://gslbeacon.lijit.com 3118
http://content.adriver.ru 3001
http://a.rfihub.com 2832
http://js.users.51.la 2703
http://c1.rfihub.net 2598
http://bdv.bidvertiser.com 2433
http://dnn506yrbagrg.cloudfront.net 2279
http://rcm.amazon.com 2165
http://webmedia.hrblock.com 2030
http://server.iad.liveperson.net 1945
http://c.cnzz.com 1894

I talked with someone on the Google Plus team and they were surprised to see their domain at the top of this list. They’re still not sure why so many requests are being served ungzipped but they are investigating the issue. I think it’s telling that even top-notch engineering companies like Google are still trying to get this right. To be fair, the only reason they are top of the list is because they use a single domain, as we’ll see when we roll up the hostnames.

Update 2/20/13: It turned out there was a bug in WebPagetest that was impacting the accuracy of this data. I’m appears that some headers where being hidden from the browser when loading over https. I have updated the data above which now shows that Google Plus isn’t the worst offender after all (they don’t even make the list). Sorry about that.

Hostname # Ungzipped Requests
akamai.net 41,918
cloudfront.net 30,107
amazonaws.com 23,947
akadns.net 17,715
google.com 13,546
cnzz.com 11,190
googleusercontent.com 10,895
akamaiedge.net 10,635
adriver.ru 6,425
edgecastcdn.net 5,900
liveperson.net 4,650
adnxs.com 4,008
llnwd.net 3,436
footprint.net 3,016
rambler.ru 2,970
51.la 2,968
yahoodns.net 2,391
lxdns.com 2,361
doubleclick.net 2,345
amazon.com 2,218

Takeaways

Doing this research was a great reminder to me of how lucky we are to have the HTTP Archive. It’s a great resource as it makes it easy to do quick analysis like this. Both the code and the data are open sourced so anyone can grab their own copy of the data to check my work, or do a deeper analysis.

The results themselves are pretty shocking. Gzip is one of the simplest optimizations for websites to employ. Turning on Gzip requires very little effort and the performance gains can be huge. So what’s going on? Why are CDNs not doing more to enable compression for their customers? Sadly, as it often turns out, to find the answer you simply need to follow the money. CDNs sell themselves as a tool for improving performance, but they also charge by the byte. The larger the files you send, the more money your CDN makes. This puts their business goals directly at odds with their marketing that says they want to help make your website fast. As a side note, Last-modified headers are another place where this conflict of interest exhibits itself. The shorter the cache life on your content, the more traffic your CDN gets to serve. Shorter TTL’s increase their revenue while hurting your website performance.

As website owners, it’s important for us to understand these business dynamics and be proactive to make sure best practices are being followed on our sites. The good news is that with Real User Measurement (RUM) it’s easier than ever to measure the actual performance that your visitors are experiencing. Less than a year ago there wasn’t a good RUM solution available on the market. Today, hundreds of sites are using Torbit Insight or a similar RUM tool to measure their site speed and correlate their website performance to their business metrics.

RUM is a great way to measure the actual results your CDN is delivering. Perhaps you’ll discover, like Wayfair, that you aren’t getting the performance gains from your CDN that you expect. As I tell people all the time, the first step to improving your speed is making sure you have accurate measurement. The second step is making sure you have covered the basics like enabling Gzip.

ABOUT THE AUTHOR

Josh Fraser (@joshfraser) is the co-founder and CEO of Torbit, a company that offers next generation web performance with a free Real User Measurement tool that allows you to correlate how your speed impacts your revenue. Torbit also offers Dynamic Content Optimization which can double the speed of an average site. Josh has been coding since he was 10 and is passionate about making the internet faster.

18 Responses to “Is your CDN intentionally hurting your performance?”

  1. Adventskalender am 20. Dezember « F-LOG-GE

    [...] entführt uns heute zu einem Tool, das beim Debuggen von RESTful APIs helfen will. Im Performancekalender ging es gestern um CDN. Beim UXMas-Kalender geht es heute um Nutzerbefragungen. Bei Digitpaint ging [...]

  2. Aaron Peters

    Hi Josh,

    It is an interesting topic, but just from what I read in this article, I find the title a bit too strong. As the co-founder of http://www.cdnplanet.com and the multi-CDN service Turbobytes, I talk to CDNs a lot, about business and tech stuff, incl. compression and caching.
    Some CDNs only send compressed content when the origin sends it compressed to the CDN, and others can do the compression on the edge too. Some CDNs will send uncompressed to a client asking for compressed content (!), because the origin did not send the Vary:Accept-Encoding header when the CDN requested the object for the first time (this one mindblowingly bad, but real world).

    We feedback on these findings to the CDNs and make an effort to get that fixed/improved.
    We have told a few CDNs that they should turn on Gzip by default, which means that *they Gzip the content on their edge servers if the client asks for compressed content and even if the origin did not send the file compressed*. We have also seen a case where a CDN, that does that gzipping on the edge, had an outdated list of MIME types. Webfont files for example would not get gzipped on the edge.
    My opinion: the CDNs should do their best to send content compressed when it makes sense. Have intelligence in place for this (no rocket science) and don’t solely rely on the origin server.
    Is there room for improvement? Yes!
    Do CDNs *deliberately* don’t gzip or have sub-optimal logic in place, so they can serve more bytes and make more money? I don’t think so.

    The CDN market is very competitive and customers shop around, negotiate hard on price etc. There is a lot of money and effort involved in customer acquisition and retention and the #1 reason to choose a CDN is still ‘better performance’. Improving performance is something CDNs are working hard on all the time. In the area of gzipping, some CDNs can do better, yes. But intentional badness … no.

    About your data …
    The last table shows the Hostname and # Ungzipped Requests.
    I would like to see percentages. What percentage of the compressable content larger than 1 KB was not sent compressed, per CDN?
    a) % of objects not sent compressed
    b) % of bytes not sent compressed

    a) could show that 2% of objects are not gzipped while it should have been, and b) could show that in the end it is only 0.1% of bytes.

    - Aaron

  3. Patrick Meenan

    FYI, WebPagetest (the engine under the HTTP Archive) recently started checking for gzip compression savings on every request regardless of MIME type and reporting the savings (and flagging any requests that can be compressed by > 10%).

    I just updated the HTTP Archive to the latest agents and I’ll see if Steve would be willing to expose the compression savings (or the optimization checks in general) in the requests table.

    Right now most CDN’s are super-conservative and essentially act like caching proxies where they depend on the origin servers to “do the right thing”. Most have settings to override that behavior but it would be great to start making more and more of these basic optimization automatic.

  4. Chris Adams

    I suspect most of the reason for Akamai’s surprising gap is that they don’t support HTTP 1.1′s Vary header[1] so you have to enable this in the CDN configuration rather than on your backend servers. Once you do that, their support is excellent, handling things like not caching separate compressed & uncompressed versions and correctly handling antique IE/XP versions which claim to but do not reliably support transfer compression but it’s a “Submit a support request” situation and I suspect many clients haven’t done that, particularly if they’re used to thinking of the CDN as an accelerator for large media files & other content which doesn’t comperss.

    1. The docs claim support for Accept-Encoding but in my testing it completely disabled caching. All other Vary headers are explicitly unsupported.

  5. Josh Fraser

    @Aaron,

    Sure, the title is provocative, but I think it’s important to point out the conflict of interest. The traditional CDN business model is broken. The truth is CDNs make more money when they serve bigger files more frequently. Some CDNs are better than others at dealing with the conflict of interests, but I’ve had CDN execs openly admit to me that they were worried about WPO technology hurting their revenue. I’ve also talked to customers who have been the victims of some pretty shady practices. (Happy to share some of their stories offline)

    I definitely agree that it would be interesting to do a deeper analysis. I was a little pressed for time, but I’d love to see other people take the data and go further.

    @Patrick

    Awesome! I’d love to see the aggregate data you have about the potential savings. Totally agree that CDNs should make gzip the default not the thing you can get if you jump through enough hoops.

  6. Stephen

    I have heard from a coworker that albeit Akamai and other CDNs deliver zipped content they actually charge the end user the bill for unzipped bytes anyways so it does not hurt their bottom line to actually server anyway the browser asks for it

    However, I believe the problem fundamentally with caching and compression is that you would have to store multiple copies of the same content thus reducing the cache hit ratio for the CDNs. Suppose you have an object that delivers with Vary: Accept-Encoding. This essentially means that you have to store the following copies on the cache: unzipped content, only gzip content and only deflate content. This means the storage requirements triple (which increases the disk seek time) not to mention the fact that most caching proxies encounter weird bugs once this is done (like if someone asks for a range request of a gzipped content which means you have to anyways unzip the whole content and rezip the ones required) etc

  7. JulienW

    For the Microsoft Azure logo, I found that merely converting it to PNG using ImageMagick makes it 13kB. Using optipng on this file makes it 12kB. So it’s probably a bad idea to gzip it, they should convert it to PNG and that will make the same savings.

  8. Chris Adams

    @Stephen: you could also take the approach that Varnish uses and store only the gzipped content in the cache and decompress it when serving clients which don’t support gzip transfer encoding. There aren’t many clients which support deflate but not gzip – see http://zoompf.com/2012/02/lose-the-wait-http-compression – so it’s probably a waste of time to handle them separately.

  9. ericlaw (ex-msft)

    “The other compression format you might come across is deflate, but it’s less popular and generally not as effective.”

    To be precise, GZIP’s compression algorithm *is* DEFLATE. As a consequence, the two formats typically provide the same levels of compression if the compressors use the same parameters.

    The reason that the Azure logo benefits from GZIP is that it was incorrectly saved with huge blocks of metadata within. If you simply resave the logo in JPEG without this metadata, it shrinks from 19k to 13k.

  10. Josh Fraser

    @Julien

    You make a good point that a single-color logo like that should never have been saved as a JPEG to begin with. I’m used to thinking about website optimization from an automation stand-point. You probably wouldn’t want an automated process to convert a JPEG to a PNG considering the differences in their compression algorithms, but you could certainly use Gzip in an automated fashion.

    @Eric

    Looks like you’re right about mod_gzip and mod_deflate being essentially the same thing. I’ve corrected the post. Thanks for pointing that out!

  11. Frédéric Kayser

    Serving gzipped files is one thing, setting the on the fly compression to a level that will not introduce latencies or make your CPU load rise considerably is another one. This usually means configuring compression in the low levels or serving heavily precompressed static files (using kzip+kzip2gz+defluff for instance).

    Deflate is old and slow (developed by Phillip Katz in 1993 for PKZIP 2.0), there are faster or better compression algorithms around now: LZ4 is blistering fast, Bzip2, PPMd and LZMA usually produce smaller compressed streams but are slower and can require more memory. Why does the Web still stick to Deflate (not even Deflate64), nostalgia?

  12. Simon Lyall

    The flip side of this is that I find a lot of clients don’t send “Accept-Encoding: gzip” in their requests, even when it appears to come from a modern desktop browser.

    At work I’ve found that while I get 80% compression on most of my html about 20% of requests are getting it uncompressed so often the bandwidth from them exceeds the other 80%!

    The feeling I get is that the uncompressed requests are coming from people behind proxy servers (especially at companies) which are stripping out those headers to simply their code/settings.

  13. Performance Calendar » Building Faster Sites and Services with Fiddler

    [...] Meenan mentioned that WebPageTest recently added a rule to check every uncompressed response and flag if a 10% [...]

  14. Josh Fraser

    @Simon,

    That’s a great point and I’m glad you brought it up. We’ve seen the same issue at Torbit. Our solution was to send down Gzipped content based on User Agent browser detection. We have a list of browsers that support Gzip and essentially ignore the Accept-Encoding header altogether.

  15. Brian

    To comment on @Aaron’s point, most CDN’s just pass along whatever the origin servers are using as far as compression and headers. This means just because you have a CDN does not mean you do not have to pay attention to it. CDN’s need to be managed and tuned just like any other piece of technology. In my experience, CDN providers have been happy to engage and work with me to improve my performance–but I had to initiate the engagement by talking to my account rep and opening support tickets, etc.

  16. Justin Dorfman

    @Josh great post.

    Not to turn this into a commercial but the CDN I work for we enable Compression when a Pull Zone is created: http://note.io/UcSdM4 (evernote screenshot)

    Here is a snippet of the nginx directives used (if you find anything questionable please let me know): https://gist.github.com/4515989

    21.9% is way too low so I am thinking we should remove the checkbox to ensure every Pull/Push Zone is optimized to the fullest. I will talk with my team Monday. =)

  17. Broken business models

    [...] will make it load faster, but CDNs also charge by the byte. This leaves them with the temptation to do things that actually hurt performance in order to make more [...]

  18. treatment gynecomastia

    All right this YouTube video is much enhanced than last one, this one has nice picture quality as well as audio.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
And here's a tool to convert HTML entities