I've recently become obsessed with the sheer amount of development activity happening on sites like GitHub.
As a first project on working with this data, I thought it would be fun to rank all the programming languages by counting how many people on GitHub use each language.
I'm using the GitHub Archive and GHTorrent projects as data sources for this analysis. The GitHub Archive provides a record of every public event on GitHub since early 2011. This includes an event every time someone has pushed new code, forked or starred a repository, or opened an issue on GitHub. Overall the GitHub Archive has more than 1.25 Billion events on more than 75 Million different repositories. GHTorrent goes even further and hits the GitHub API for each event - which lets me resolve the language for most repositories.
The cool thing about this is that there are usernames associated with each of those events, which means that I can count how many different people are using each language. Every time a user interacts with a repository I'm counting that user as using the language of that repository - and then aggregating this each month to calculate how many Monthly Active Users (MAU) each language has.
Since the data goes back 7 years, I can also plot how popular each programming language was over time which reveals some interesting patterns. Looking at these trend lines, we can figure out which programming languages are worth learning, and which programming languages probably should be avoided.
All the code for this post is up on GitHub, and I'm planning to keep the rankings there up to date by rerunning this analysis monthly. There are also detailed notes there on how I'm resolving which programming language each repository is written in, which turned out to be far and away the most difficult part of this entire project.
While the repo has overall rankings for each language, I'm also including it here to save you a click:
|Language||Monthly Active Users||Trend|
While the overall rankings are pretty interesting, it's worth taking a deeper look at how these languages have been performing over time.
The major programming languages have relatively stable usage, and are mostly what you'd expect:
Python is growing over time and recently overtook Java to become the second most popular language on GitHub. A lot of the Python growth seems to be coming from interest in Machine Learning. In fact, the overall popularity of Python might be undercounted here (more on that later).
C++ also seems to be gaining over C. This makes some sense; even projects like GCC have converted over from C to C++ in order to get access to some features in C++. Since C++ is mostly a superset of C, the GCC team found that using a limited subset of C++ allowed them to write cleaner code.
The best thing about looking at programming language trends like this is in identifying up-and-coming languages that have a quickly growing user base:
The languages with the fastest-growing user bases are Go, TypeScript, Kotlin, and Rust.
Kotlin seems to be mostly used for Android app development. You can almost see the slope change in the graph around when Google announced first class support for Kotlin on Android early in 2017. The language looks intriguing though, and I'm considering using it for the next project on the JVM I need to work on.
While Rust is growing slower than the other languages here, there are many amazing projects being written in it. In just the last couple of weeks, I've seen a new sampling profiler for Ruby and a new autodifferentiation framework both written in Rust - and Rust is being used by countless other interesting projects.
One interesting thing that all of these languages have in common is that they are all being sponsored by major companies: Google started Go, Microsoft with TypeScript, JetBrains with Kotlin and Mozilla with Rust. Successfully launching a new programming language requires a fair bit of effort - it's not enough to just develop an elegant language, you also have to grow the community and the ecosystem behind the language. Things like IDE support, libraries and packages for common tasks, tools and documentation all matter immensely in getting people to use a language, and this requires significant effort.
It's also interesting looking at the languages that are declining by this metric:
Ruby, PHP, Objective-C, CoffeeScript, and Perl have all had their percentage of users on GitHub decline significantly over the last 7 years. Based on this I'd be somewhat hesitant to use any of them for a new project.
Ruby has seen the steepest decline in this timeframe, going from the 2nd most popular language in 2011 with over 18% of users to the 11th most popular today with 3.2% of users. While this is a shocking decline, it's worth pointing out that these stats are given as a percentage of the GitHub user base - and GitHub is growing fast.
The total number of users on GitHub that appear in the public timeline has exploded in the last 7 years:
User growth has increased over 20x on GitHub in the last 7 years! Any day of 2018 has seen more users committing to GitHub than in any month of 2011.
This means that even languages with a declining market share can still have a growing user base. Here is the same data for these languages when we don't normalize by the number of active users:
Looking at it this way, Ruby has more than 3x the number of active users using the language than in 2011. It just hasn't grown nearly as fast as other languages, causing it to perform relatively worse on this analysis.
There are also a couple of other factors at work here.
First of which is that as GitHub has grown, it's naturally grown more mainstream. GitHub was started by prominent members of the Ruby community, and much of their code is written in Ruby. This caused them to attract a large number of Ruby programmers in their early days causing Ruby to be overrepresented then. As GitHub grew its user base it naturally reached out to programmers in other languages. You can see this in how Java users grew on GitHub from 2011 to 2014: my take is that Java didn't increase in popularity then so much as Java programmers started to use GitHub then.
The second thing to note is that certain newer languages seem to be cannibalizing user share from older languages. For instance, the decline in Objective-C usage corresponds with the rise of Swift. Also, CoffeeScript seems to have been mostly supplanted by TypeScript:
There was one other fast-growing 'language' included in the results that I purposefully left out:
Juptyer Notebooks have seen significant and steady growth in the last couple of years. However, this seems to be mostly because of the growth of Python for doing data science. While Jupyter supports many other languages than just Python, in every case I've looked at the Jupyter Notebook was written in Python. This means that the popularity of Python is potentially undercounted in this analysis.
Also, GitHub assigns a language to a repo by looking at which language has the most bytes in the repo. Since Jupyter Notebooks can include images and other non-code things in them, this means they are frequently the largest files by bytes even if not necessarily by line count. This has lead several Python repositories to be incorrectly labeled as being Juptyer Notebooks.
In the future, I might merge Juptyer and Python together when running this analysis.
Finally, I thought it would be interesting to take a look at how functional languages are doing on GitHub.
Given the relatively small numbers here, there is more noise in the rankings. I'm also not convinced that the apparent decline of things like Clojure and Haskell is anything more than a result of GitHub growing more mainstream over time.
Elixir seems to be worth keeping an eye on though, and only narrowly missed out on being in the top 25 languages.
There are a bunch of other approaches in tracking programming language popularity over time, and I thought I'd link quickly to some of the more popular ones:
While this is hardly the first attempt to track programming language usage, I prefer my approach of tracking the number of unique users that have interacted with a language. Tracking the raw number of times something occurs can bias towards things that have a small number of users with a lot of activity. For instance, there are bots on GitHub that can commit thousands of times a day and will skew results if you look raw activity counts, but this is easily handled by counting users instead of events.
All the code for this post is up on GitHub as well as the overall rankings. I'm planning on keeping the data there up to date and will update monthly as long as there is sufficient interest.
Published on 25 January 2018