An Analysis of the World's Leading robots.txt Files

A site’s robots.txt file advises the web crawlers of the worlds what files they can and can’t download. It acts as the first gatekeeper of the internet, unlike blocking the response - it lets you stop requests to your site before it happens. The interesting thing about these files is that it lays out how webmasters intend automated processes should access their websites. While it’s easy for a bot to just ignore this file, it specifies an idealized behaviour of how they should act.

As such these files are kind of important. So I thought I’d download the robots.txt file from each of the top million websites on the planet and see what kind of patterns I could find.

I got the list of the top 1 million sites from Alexa and wrote a small program to download the robots.txt file from each domain. With the data all downloaded, I ran each file through pythons urllib.robotparser package and started looking at the results.

Found in yangteacher.ru/robots.txt

Walled Gardens: Sites that ban everyone except Google

One of my pet peeves are sites that allow GoogleBot to index all of their content but ban everyone else. For instance, Facebook’s robots.txt file starts off with:

Notice: Crawling Facebook is prohibited unless you have express written permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

This is a little hypocritical because Facebook itself started off by crawling Harvard students’ profile pages - the exact kind of activity they are trying to ban for other people now.

Requiring a written agreement to be in place before allowing crawling of your site flies in the face of the ideals of an open internet. It discourages academic research, and provides a barrier to entry to new search engines: DuckDuckGo is banned from crawling Facebook while Google isn’t for instance.

In a quixotic effort to name and shame sites that engage in this kind of behaviour, I wrote a simple script that checks for domains that let Google index their homepage - but ban everyone else. The most popular domains found that do this are:

Domain	DuckDuckGo	All English Chinese Russian French German
facebook.com		English
instagram.com		English
linkedin.com		English
netflix.com		English
github.com		English
quora.com		English
yelp.com		English
more ...

I’ve limited to domains that are in English so that it’s familiar to people reading this, but you can change the language to view international sites. I’ve also included whether the site lets DuckDuckGo index their homepage, in an effort to show just how much of an uphill battle new search engines have in getting started these days.

Most of the top domains above - like Facebook, LinkedIn, Quora and Yelp - have one thing in common. They are hosting user-generated content that is most of the value in their own business. This data is one of the most valuable assets these businesses have, and they aren’t just going to give it away for free. To be fair though, these bans are frequently presented in terms of protecting user privacy, like in this post from Facebook’s CTO explaining the decision to ban crawlers or deep in Quora’s robots.txt where they explain why they’ve banned the wayback machine.

Farther down the list and the results aren’t quite as consistent - for instance it is not clear to me why census.gov is only allowing the 3 major search engines to access their content, but banning DuckDuckGo. You would think that this data would belong to the American people instead of just being for Google/Microsoft/Yahoo.

While I’m not a fan of this kind of behaviour, I can certainly understand the impulse to only whitelist certain crawlers given all of the bad robots that exist out there.

Bots Behaving Badly

Something else that I wanted to try out was to identify the worst web crawlers on the internet, by leveraging the collective opinion of the million robots.txt files I downloaded. To figure out which bots are the worst actors, I counted up how many different domains completely banned a useragent - and then ranked the useragents by how many times they were blocked:

user-agent	type	count
MJ12bot	SEO	15156
AhrefsBot	SEO	14561
Baiduspider	Search Engine	11473
Nutch	Search Engine	11023
ia_archiver	SEO	10477
WebCopier	Archival	9538
WebStripper	Archival	8579
Teleport	Archival	7991
Yandex	Search Engine	7910
Offline Explorer	Archival	7786
more ...

There are a couple of distinct types of bots on this list.

The first group is crawlers that gather data for SEO and marketing analysis. These firms want to get as much data possible to power their analytics - causing noticeable load on many servers. Ahrefs even brags that “AhrefsBot is the second most active crawler after Googlebot”, so it is understandable that people would get annoyed and block them. Majestic (MJ12Bot) positions itself as a competitive analysis tool, meaning that it is crawling your site in order to give business insight to your competitors - but also claims that it has the “worlds largest link index” on their homepage.

The second group of user-agents is from tools that aim to quickly download a website for personal offline use. Tools like WebCopier, Webstripper and Teleport all let you quickly download entire websites onto your hard-drive. The problem is the quickly bit, all of these tools have obviously hammered sites enough that they are frequently banned here.

Finally, there are search engines like Baidu (BaiduSpider) and Yandex that can aggressively index content, while only serving languages/markets that don’t necessarily bring a ton of value to certain sites. Personally, I get a non-trivial amount of traffic from both of these, so wouldn’t suggest blocking either.

Job Ads

It’s a sign of the times that files that are meant for consumption by robots now frequently contain job ads looking for software engineers - especially people interested in SEO.

Given I have all this data here, I thought it would be sort of interesting to present the worlds first (and probably only ever) jobs board based entirely off of descriptions scraped from robots.txt files:

< Previous Next >

#1 airbnb.com/robots.txt

#                      ///////
#                     //     //
#                    //       //
#                   //         //                          ////         ///                      ///
#                  //           //                                      ///                      ///
#                 //     ///     //               //// ///  /// (// (// /// ////     /// ////    /// ////
#                //   ///   ///   //           &//////////  /// (////// ///////////  //////////  ///////////
#               //   //       //   //          ///     ///  /// (//     ///      /// ///     /// ///     ///
#              //    (/       //    //        ///      ///  /// (//     ///      /// ///     /// ///      ///
#             //      //     //      //        ///     ///  /// (//     ////    //// ///     /// ///     ///
#            //        //   //        //        //////////  /// (//     //////////   ///     /// //////////
#            /(         /////         (/
#            //         ////#         //
#             //      ///   ///      //
#               //////         //////
#
#
#    We thought you'd never make it!
#    We hope you feel right at home in this file...unless you're a disallowed subfolder.
#    And since you're here, read up on our culture and team: https://www.airbnb.com/careers/departments/engineering
#    There's even a bring your robot to work day.

In a small bit of irony, Ahrefs.com who is the developer of the 2nd most banned bot I identified here also has an ad out for an SEO person in their robots.txt file. Also, pricefalls.com prefaces the job ad in their robots.txt file with “Notice: Crawling Pricefalls is prohibited unless you have express written permission.”

All the code for this post is up on GitHub.

Published on 18 October 2017

Get new posts by email!

Enter your email address to get an email whenever I write a new post:

Follow me on:
GitHub
Twitter
RSS

Follow me on: