One of the things I'm working on right now is a project that's aggregating data found in developers GitHub profiles. Since there are a couple of problems with using GitHub profiles as a data source like this, I wanted to first list out some of the issues I have with trying to assess developers by looking only at their GitHub contributions.
One common misuse of GitHub profile data is in trying to filter out job candidates. People still seem to think that you can figure out how talented a developer is merely by looking at their open source contributions. As an example in the latest Hacker News' Who is Hiring thread, there are a bunch of different job ads asking for a Github profile as part of the job application.
There are already a bunch of great posts arguing against requiring GitHub contributions as part of the hiring process. I particularly recommend The Ethics of Unpaid Labor and the OSS Community and Why GitHub is Not Your CV. While both of those posts give excellent reasons to reconsider asking for open source contributions when hiring, my take here isn't about why it is ethically dubious to require open source contributions or why GitHub isn't great for showcasing your projects.
Instead, this post is about why GitHub profiles just aren't all that useful when looking to hire developers.
If you checked out the public GitHub profile of the best software engineer I've ever worked with, you'd see something like:
While he has written a ton of code at his work in the last year, he hasn't posted anything that can be viewed publicly: he has no public commits, he hasn't created any repositories of his own and he has an insignificant number of followers. Despite all that he's still the best developer I've ever had the pleasure of working with.
He's also not unusual in having a relatively inactive GitHub profile: the vast majority of GitHub users don't have strong profiles. To quantify this, I aggregated the public commits for each user from the GitHub Archive and found that:
This kind of distribution is roughly a power-law (or at least something akin to it). This means most users have relatively little activity to show, but a small number of accounts will have had hundreds of thousands of commits in the last year.
Take a look at this graph that plots out the percentage of developers on GitHub that have a certain number of followers for an example of how this looks plotted out:
I thought it would be fun to let people see where their own GitHub profile would fall on this ranking, so I'm showing the number of followers since getting the number of commits from the GitHub API is difficult, and both have a similar distribution. If you enter your GitHub userid there, you'll also get a mini report of how you rank.
The plus side of all this is that even with as few as 10 followers, you can claim with a straight face that you're in the top 1% of all developers.
The downside is that since the vast majority of developers don't have any data in their public GitHub profiles, you can't use them to filter out job candidates. 83% of users have no commits in the last year, just like 88% have no followers. This doesn't mean these are bad developers, it just means that these developers don't have any open source contributions to show off.
This probably goes without saying, but public GitHub profiles only really show off people that produce open source software. The vast majority of software being produced is closed source software, and not having any open source contributions means approximately nothing if you've spent your career working on proprietary technology.
I think it's kind of instructive to look at this by comparing famous programmers that work in open source software to famous programmers that don't. For instance, Linus Torvalds has the most followed profile on GitHub. This is pretty justifiable since he's the creator of several incredibly successful open source projects like Linux and Git. On the other hand, neither John Carmack nor Jeff Dean even have GitHub profiles that I can find - despite them both being well known for their work on equally successful closed source projects like Doom and Google.
I've always felt that asking for evidence of open source contributions when looking to hire developers to work on a proprietary closed source project is the height of hypocrisy. It reminds me of companies that require references but have a policy of not giving them out to former employees. If you aren't going to let the developer write open source projects for the job, it doesn't make sense requiring existing open source projects to get the job.
Even ignoring the hypocrisy, you've got to question requiring GitHub contributions when it will exclude most developers including engineers like John Carmack and Jeff Dean. I feel like there should be a 'Jeff Dean Test' for hiring developers: if your job requirements exclude someone like Jeff Dean from a software development job at your company, you are probably doing it wrong.
Even for the small fraction of developers that have a project on their GitHub profiles, most of the projects aren't all that impressive.
Many coding bootcamps and universities these days require their students to create GitHub repositories as part of the curriculum. While I'm totally behind teaching new developers solid version control skills, the projects that are created as part of this process don't tell me much more than seeing they completed the program. As an example, there are around 190,000 repositories on GitHub that are named "datasciencecoursera".
Likewise out of the more than 78 million repositories linked to in the GitHub Archive, 1.1 million repositories have a name like 'hello-world' and 1 million repositories are called 'test'.
Since so many projects on GitHub are pretty mediocre, one easy short circuit might be to only seriously consider projects that have been repeatedly starred or profiles that have lots of followers. Even ignoring that this further limits the pool of available candidates, it's still not an effective way of assessing the quality of a developer - since it only shows popularity instead of talent.
As an example, take a look at the GitHub profile for Stichpunk:
This profile has about 1560 followers, putting her in the top 0.002% of developers on GitHub. She also has multiple moderately popular repositories and appears to work at a major tech company. On the surface, this looks like a pretty respectable GitHub profile.
This isn't a real profile though and was instead created by the writers of the TV Show Silicon Valley for their 'Tabs Versus Spaces' episode:
Anytime you're seriously considering the number of GitHub followers as being in any way meaningful, just remember that a fake profile created for a character that appears only in 1 episode of a TV show has more followers than 99.998% of the world's developers and would rank like 670th or so globally.
Likewise, if you look at the most popular starred repositories on GitHub, many of them are listicles or jokes. While it may not be easy to write lists or jokes as awesome as those, it doesn't say anything about the talent of the developer.
Many popular projects are actually pretty amazing, and I'm not trying to argue that popularity and quality aren't at least weakly correlated. The problem is that getting a project to become popular is a very different task requiring a very different skillset than just writing quality code.
Not only are GitHub profiles not that helpful for hiring developers, it also seems like they aren't that much help for developers that are looking for work.
I've only had one job interview in the last decade or so, but as far as I could tell none of the people interviewing me checked out my GitHub profile before the interview. In fact, one of the interviewers claimed that he hadn't even read my resume before the interview let alone my GitHub - which I found endearingly honest.
It seems like this is pretty common from some anecdotal reports I've heard. For instance, Dan Luu posted this on twitter:
"Despite the hype about how open source helps your career and how github==resume, I've had 2/50 interviews where someone's looked at my code" Dan Luu
Dan Luu is in the top 1K most followed accounts on GitHub, which means that at least my experience wasn't because of my relatively modest GitHub portfolio.
Likewise, on hftguy.com one developer looked at GitHub analytics after doing a bunch of interviews and found only 1 person checked out his projects (which may have been himself testing):
"After a dozen of phone interviews (1 dev per call) and a couple of on-sites (4 to 7 devs per on-site), there was only 1 visit to my profile. Conclusion: Noone cares about GitHub. Noone is gonna read it."
While GitHub profiles have some issues as a data source, I'm still planning on using GitHub profiles for a future project. The idea is that while an individual GitHub users profile might be a sparse noisy signal, by aggregating multiple GitHub users in a population together there are still some interesting trends that can be pulled out.
There isn't an API to get the contribution activity in the last year from GitHub. Instead, people seem to get the timeline image as an SVG (like hitting this endpoint https://github.com/users/benfred/contributions), and then parse the SVG to get the number of contributions. I almost went with this hack for this post but hit some problems with CORS restrictions from loading this in the browser. I could have written a proxy to host the requests, but it was getting silly so went with the number of followers instead.
Published on 08 March 2018