I've been digging into GitHub data recently, and I thought it would be fun to use that data to figure out exactly where the world's software developers live and then to visualize the results interactively using D3.
In a previous post, I wrote about how an individual's GitHub profile is a noisy and unreliable indicator of programming talent. For this post though I'm aggregating GitHub profiles together over the population of an entire country or city, meaning that the impact of the data sparsity and noise issues I was talking about shouldn't be nearly as significant.
The results ended up being pretty interesting. While top developers live all over the world, an extraordinary amount of them seem to live in the San Francisco Bay area. Likewise, it seems that developing open source software is a luxury for the rich, India has an unusual lack of famous developers, and Eastern Europe might offer the best value on hiring remotely.
I used the GitHub Archive to get a list of all the GitHub users that have had any public activity in the last 7 years. This includes actions like forking or starring a repository, opening or commenting on an issue, and pushing commits. All told there were slightly over 15 million GitHub accounts that met this criteria. Over the course of several months, I fetched the profile information for each of these users using the GitHub /users/ API. After fetching the profiles, about 2.3 million of these users had a location listed. While the most of these accounts don't have a profile listed, and I also didn't fetch the 13+ million accounts that haven't done anything publicly - it is still a large enough of a dataset to draw some trends from.
I used the Google Maps Geocoding API to transform the messy freeform location strings into a properly normalized location. There were a couple of geocoding errors from this process that I manually excluded. For instance,
The Future was geocoded to Palo Alto,
World Wide Web was geocoded as being in Boston and
/dev/random was placed in India. Despite these few errors, Google Maps did an amazing job of taking messy location strings and returning properly normalized data.
All the code for this post is up on GitHub, and I'm planning on posting this data somewhere like Big Query so that people can play with it themselves soon.
If we restrict to just looking at the top developers ranked by the number of followers, we can take the geocoded locations and plot out each developer on a map.
Displaying the top 1K developers using a force directed layout so that they don't totally overlap, and you can see where all of these developers live:
The choice of showing only 1K developers is rather arbitrary, so you can show fewer or more developers by adjusting the slider above. You can also zoom in to get a better view by clicking on a country, and you can see details on individual developers by hovering over them.
While great developers live in every continent across the globe, it's pretty clear that they tend to cluster in a couple of key locations. The biggest cluster is unsurprisingly around the San Francisco Bay area, but there are other large clusters in and around Beijing, New York, London, Shanghai, and Seattle.
Restricting to just the top developers by followers ignores the millions of other accounts that we have data on. One easy way to aggregate the rest of these accounts is by looking at how many GitHub Accounts each country has:
It's also worth looking at the same information displayed as a choropleth. Not only does this show overall geographic trends, but you can also get a report of how each country ranks against a bunch of metrics by clicking on it:
The United States dominates the rankings when looking at the number of accounts: it has more GitHub accounts than the next 5 countries put together. However, this isn't the only way of ranking each country. I've included 4 different ways of ranking the countries here, and I just wanted to talk quickly about why.
One problem with viewing the raw number of GitHub accounts per country is that the top countries tend to be those with large populations.
XKCD summed up this problem pretty well:
While the United States is undeniably a technological superpower, it's also one of the most populous countries on the planet. As an example, the United States has more GitHub accounts than Iceland has people, so you can't really expect Iceland to be competitive when ranking by the number of GitHub accounts. However, if you switch the ranking to look at GitHub accounts per population - Iceland actually has more developers per capita.
Looking at a scatter plot shows that GitHub accounts and Population are at least somewhat correlated:
The trendline for a log-log regression is shown in orange and has an R2 of 0.5, meaning that half the variance in GitHub accounts per country can be explained away by population. I've highlighted countries on the upper and lower Pareto frontier to show certain outliers, but you can also hover over any dot to see the country name and exact values.
Countries significantly above the trendline are mostly rich western countries like Iceland, Sweden, Norway, and Denmark. Countries that are significantly below tend be poorer African nations like Ethiopia, the Congo, and Chad. If you switch the ranking above to view 'Accounts / 1M Population' and the country choropleth ends up looking suspiciously close to maps measuring how wealthy each country is.
This seems to indicate that while population is correlated with GitHub accounts, a better way to look at this is comparing GitHub accounts to GDP:
The correlation here is much stronger with an R2 of 0.84, but it's still interesting to look at which countries over and underperform when comparing GitHub accounts to GDP.
The worst performers here are mostly oil-producing nations like Saudi Arabia, Iraq, Qatar, and Kuwait. This seems to be another example of Dutch disease where natural riches cause countries to under-invest in human capital like Software Engineers.
Likewise, former Soviet states rank seem to rank highly by this measure. Ukraine, Belarus, Serbia, and Moldova all have more GitHub accounts than would be predicted by their respective GDP's. In particular, Ukraine seems like it might be the best bet in terms of getting a deal on hiring a remote developer: there is plenty of Ukrainian talent available at what I imagine must be reasonable rates.
India and China both have a similar number of GitHub accounts so you would expect both of these countries to have a similar influence on GitHub. However, looking at the map of top 1K developers and China has large clusters of developers around Beijing, Shanghai, and Shenzhen - while India only has a few developers scattered across its major cities.
While China and India both have similar populations, 22% of the top 1K developers live in China while less than 1% of the top developers live in India. Likewise, the top developer in China has 39K followers and would rank 3rd globally, while the top developer in India has only 3.1K followers and would rank around 200th of the developers we could geolocate.
To capture this relationship, I've added a simple 'Total Followers' ranking that just ranks each country by the total number of followers every account in the country has. Using this ranking India falls down from 3rd to 8th - and China has about 5.5x as many total followers as India does.
It's not clear to me why India underperforms here, and I'm also skeptical that the number of followers an account has on GitHub is especially meaningful. However, the number of followers an account has on GitHub does seem to be at least roughly correlated with how influential the user is with open source software development so I'm including this as a ranking of how influential each location is.
Looking at just the top countries for software developers is like missing the trees for the forest: the difference between rural and urban areas in most developed countries is almost certainly more pronounced than the difference between major cities in these countries.
Ranking each city in a similar way reveals similar patterns as seen on the graph of top developers:
To see where your city ranks, you can filter down to your own country.
San Francisco really stands out here - it only has around 860K people and still ranks above cities like Shanghai with a population of 25 million people. However, only a tiny fraction of people in the Bay Area live in San Francisco - and a huge amount of development takes place in the South Bay. To get a sense of that, I've included everything that geolocated to Santa Clara County, San Mateo County, San Francisco County and Alameda County under the 'San Francisco Bay Area' label.
It really can't be overstated how dominant California is when looking at all this data:
|GitHub Accounts||Total Followers|
|San Francisco Bay Area||68,833||1,449,521|
If you rank by the number of followers, San Francisco would rank above every other country on the planet except for the United States and China. A city with 860K people has more influence than countries like the UK, Germany, Canada, Brazil, and India with 10's to 100's of millions of people. Likewise if California were a country, it would have more GitHub accounts than every country except for the US, China, and India.
On a per capita basis, San Francisco has over 40,000 accounts per million people. This means that San Francisco has over 10x as many developers per capita as the top-ranked country in the world (Iceland). At least 4% of the city has a GitHub account, and the number is probably quite a bit higher given that only around 15% of accounts have a location listed in their profile.
While the Bay Area is obviously the biggest location for software development in the world, most of the world's software developers live elsewhere. Even inside the US, only about 11% of all GitHub accounts are from the Bay Area - and worldwide less than 3% of software developers live there.
Published on 24 April 2018
Enter your email address to get an email whenever I write a new post: