Automatically Detecting 3PC with ThirdPartyContent.org

30thDec 2013 by Billy Hoffman

ABOUT THE AUTHOR

Billy Hoffman (@zoompf) is the founder and CTO of Zoompf Inc, a provider of web performance products and services. Zoompf's scanning technology helps website owners find and fix performance issues which are slowing down their sites. Previously Billy was a web security researcher at SPI Dynamics and managed a research team at HP.

He can open a Coke can without using his hands.

Everyone knows that third party content can be a real problem for your website. Third party content, or 3PC, can create single points of failure (SPOF) that can drastically slow or even completely stop your webpage from loading or rendering properly. They can also be a source of unoptimized, bloated content that lowers your page load times.

Since 3PC can cause so many problems, you need to be able to detect what third party content is in use. In the past this was normally a manual process. People would say they just “knew” what third party content they were using. And if someone needed to actually go detect what was being used they had to manually comb through the source code to find out. Unfortunately this manual approach isn’t scalable.

First of all, the amount of third party content is growing. Strangeloop’s survey from 2 years ago found that the top websites had, on average, more than 7 pieces of externally hosted 3PC. In fact, pages are including so much 3PC an entire class of “Tag Management” software was created solely to manage what analytics libraries and tracking pixels a website uses. Perhaps the biggest validation that tag management is a real issue occurred last year when Google released their own tag manager. And remember, tag management software focuses on just a subset of all 3PC that’s being used on a website. Detecting all 3PC is an unmet need.

Additionally, the typical webpage now includes nearly 100 pieces of content. This number is only increasing. Finding the third party content needles in the ever-growing haystack of your website’s content is becoming more and more difficult. Clearly we need to mature into a more sustainable approach. We need to find ways to automatically detect 3PC.

Benefits of Automated 3PC Detection

Beyond simply dealing with the growing amount of difficult-to-find 3PC, automated detection helps in other ways. If we could automate the detection of 3PC, we could replace a manual, inconsistent, and time consuming process with a dependable, scriptable, and comprehensive one. As someone focused on web performance, I see two immediate and powerful benefits:

Help manage SPOF risk. If you could automatically detect what 3PC a website was using, you could understand your risk of SPOF caused by content beyond your control. You need this information so you can make informed decisions and discover problems. Do you really need those 4 tracking pixels? Wait, why is that jQuery plug-in linked directly to the public GitHub repo URLs? Wow, you are linking to a live chat widget that is not longer used! Automated detection of 3PC makes it possible to scale this kind of analysis.
Improve efficiency by classify content. If you could automatically detect whether content was 3PC or not, you could better filter performance or functional bugs. For example, why can’t YSlow or PageSpeed filter out those 3PC URLs for tracking widgets when showing you which JavaScript files aren’t getting cached? Or, perhaps you only care about how third party content is performing and want to filter everything except third party content. How great would it be if automated test suites could raise/lower the severity of an issue based on whether the problem is with your content or with 3PC? Automated detection of 3PC makes it possible to do this analysis in a timely, repeatable manner.

Both of these benefits are predicated on having a mechanism to detect third party content. This detection of 3PC isn’t necessarily helpful by itself. It’s the thing you build on top of that detection, the thing that takes the result of ThirdPartyContentDB.is3PC(someURL); and does something amazing that’s valuable. Powerful projects and products can be built on top of such a detection system, beyond anything I can imagine or discussed here.

Since so many wonderful things can be built on top of a robust, up-to-date database that can detect third party content, I decided to build one.

Announcing ThirdPartyContent.org

Today I am launching ThirdPartyContent.org. We are building an open source database of third party web content, including the URL signatures to flag matching content. Imagine something like this for Google Analytics:

 { 'name': 'Google Analytics' 'type': 'analytics' 'homepage': 'http://www.google.com/analytics/' 'patterns': ["/google-analytics\\.com\\/(urchin\\.js|ga\\.js)/i"] }

Now imagine thousands of entries in a database, detecting share widgets and like buttons and tracking pixels and ad servers, all available for you to use for free!

Our goal is to build and maintain a simple, current, open source database of this information and these URL signatures that can detect third party content. This will be a free, open database that anyone can use for any purpose. The database will be language agnostic, human readable, and not depend on [insert any esoteric technology fad]. The database will be simple and easy enough that it can be downloaded and used locally by a program. Finally, we will keep this database up-to-date. URLs change. New 3PC appears. We will create APIs and methods for the community to keep the database up-to-date with reliable signatures.

Luckily, there are many existing resources we can leverage to build this database, including:

Ghostery is a browser plugin that detects third party analytics packages and tracking software entirely based on URL regexs. While Ghostery is no longer open source, its last open source release contained over 200 signatures for 3PC in a JSON file.
The HTTP Archive and it’s HAR files can be mined to extract all the URLs. Stripping these URLs of query strings, and grouping by unique URLs will show the common URLs that are references by multiple sites. These URLs will be, by definition, third party content.
Zoompf has huge data sets of website scans from our free performance scans and customers. This data is similar to the HTTP Archive in that we can extract and group common URLs to uncover third party content.
Your input and additions!

Come Join Us

I very much want ThirdPartyContent.org to be an open, vendor neutral collaboration. If you’d like to contribute, please visit ThirdPartyContent.org and sign up for our mailing list. We are using a temporary MailChimp list for now as we decide how best to manage our efforts. (Got an idea, let us know!)

Web Performance Calendar