A site's robots.txt file advises the web crawlers of the worlds what files they can and can't download. It acts as the first gatekeeper of the internet, unlike blocking the response - it lets you stop requests to your site before it happens. The interesting thing about these files is that it lays out how webmasters intend automated processes should access their websites. While it's easy for a bot to just ignore this file, it specifies an idealized behaviour of how they should act.
As such these files are kind of important. So I thought I'd download the robots.txt file from each of the top million websites on the planet and see what kind of patterns I could find.
I got the list of the top 1 million sites from Alexa and wrote a small program to download the robots.txt file from each domain. With the data all downloaded, I ran each file through pythons urllib.robotparser package and started looking at the results.
One of my pet peeves are sites that allow GoogleBot to index all of their content but ban everyone else. For instance, Facebook's robots.txt file starts off with:
This is a little hypocritical because Facebook itself started off by crawling Harvard students' profile pages - the exact kind of activity they are trying to ban for other people now.
Requiring a written agreement to be in place before allowing crawling of your site flies in the face of the ideals of an open internet. It discourages academic research, and provides a barrier to entry to new search engines: DuckDuckGo is banned from crawling Facebook while Google isn't for instance.
In a quixotic effort to name and shame sites that engage in this kind of behaviour, I wrote a simple script that checks for domains that let Google index their homepage - but ban everyone else. The most popular domains found that do this are:
I've limited to domains that are in English so that it's familiar to people reading this, but you can change the language to view international sites. I've also included whether the site lets DuckDuckGo index their homepage, in an effort to show just how much of an uphill battle new search engines have in getting started these days.
Most of the top domains above - like Facebook, LinkedIn, Quora and Yelp - have one thing in common. They are hosting user-generated content that is most of the value in their own business. This data is one of the most valuable assets these businesses have, and they aren't just going to give it away for free. To be fair though, these bans are frequently presented in terms of protecting user privacy, like in this post from Facebook's CTO explaining the decision to ban crawlers or deep in Quora's robots.txt where they explain why they've banned the wayback machine.
Farther down the list and the results aren't quite as consistent - for instance it is not clear to me why census.gov is only allowing the 3 major search engines to access their content, but banning DuckDuckGo. You would think that this data would belong to the American people instead of just being for Google/Microsoft/Yahoo.
While I'm not a fan of this kind of behaviour, I can certainly understand the impulse to only whitelist certain crawlers given all of the bad robots that exist out there.
Something else that I wanted to try out was to identify the worst web crawlers on the internet, by leveraging the collective opinion of the million robots.txt files I downloaded. To figure out which bots are the worst actors, I counted up how many different domains completely banned a useragent - and then ranked the useragents by how many times they were blocked:
There are a couple of distinct types of bots on this list.
The first group is crawlers that gather data for SEO and marketing analysis. Theses firm want to get as much data possible to power their analytics - causing noticeable load on many servers. Ahrefs even brags that "AhrefsBot is the second most active crawler after Googlebot", so it is understandable that people would get annoyed and block them. Majestic (MJ12Bot) positions itself as a competitive analysis tool, meaning that it is crawling your site in order to give business insight to your competitors - but also claims that it has the "worlds largest link index" on their homepage.
The second group of user-agents is from tools that aim to quickly download a website for personal offline use. Tools like WebCopier, Webstripper and Teleport all let you quickly download entire websites onto your hard-drive. The problem is the quickly bit, all of these tools have obviously hammered sites enough that they are frequently banned here.
Finally, there are search engines like Baidu (BaiduSpider) and Yandex that can aggressively index content, while only serving languages/markets that don't necessarily bring a ton of value to certain sites. Personally, I get a non-trivial amount of traffic from both of these, so wouldn't suggest blocking either.
It's a sign of the times that files that are meant for consumption by robots now frequently contain job ads looking for software engineers - especially people interested in SEO.
Given I have all this data here, I thought it would be sort of interesting to present the worlds first (and probably only ever) jobs board based entirely off of descriptions scraped from robots.txt files:
# /////// # // // # // // # // // //// /// /// # // // /// /// # // /// // //// /// /// (// (// /// //// /// //// /// //// # // /// /// // &////////// /// (////// /////////// ////////// /////////// # // // // // /// /// /// (// /// /// /// /// /// /// # // (/ // // /// /// /// (// /// /// /// /// /// /// # // // // // /// /// /// (// //// //// /// /// /// /// # // // // // ////////// /// (// ////////// /// /// ////////// # /( ///// (/ # // ////# // # // /// /// // # ////// ////// # # # We thought you'd never make it! # We hope you feel right at home in this file...unless you're a disallowed subfolder. # And since you're here, read up on our culture and team: https://www.airbnb.com/careers/departments/engineering # There's even a bring your robot to work day.
In a small bit of irony, Ahrefs.com who is the developer of the 2nd most banned bot I identified here also has an ad out for an SEO person in their robots.txt file. Also, pricefalls.com prefaces the job ad in their robots.txt file with "Notice: Crawling Pricefalls is prohibited unless you have express written permission."
All the code for this post is up on GitHub.
Published on 18 October 2017