I've recently become obsessed with the sheer amount of development activity happening on sites like GitHub.
As a first project on working with this data, I thought it would be fun to rank all the programming languages by counting how many people on GitHub use each language.
I'm using the GitHub Archive and GHTorrent projects as data sources for this analysis. The GitHub Archive provides a record of every public event on GitHub since early 2011. This includes an event every time someone has pushed new code, forked or starred a repository, or opened an issue on GitHub. Overall the GitHub Archive has more than 1.25 Billion events on more than 75 Million different repositories. GHTorrent goes even further and hits the GitHub API for each event - which lets me resolve the language for most repositories.
The cool thing about this is that there are usernames associated with each of those events, which means that I can count how many different people are using each language. Every time a user interacts with a repository I'm counting that user as using the language of that repository - and then aggregating this each month to calculate how many Monthly Active Users (MAU) each language has.
Since the data goes back 7 years, I can also plot how popular each programming language was over time which reveals some interesting patterns. Looking at these trend lines, we can figure out which programming languages are worth learning, and which programming languages probably should be avoided.
I recently bought a system that actually has a decent GPU on it, and I thought it would be cool to learn a little bit about CUDA programming to really take advantage of it.
The obvious choice of problems to get started with was extending my implicit matrix factorization code to run on the GPU. I've written a couple of posts about this recommendation algorithm already, but the task is basically to learn a weighted regularized matrix factorization given a set of positive only implicit user feedback. The nice thing about this model is that it is relatively simple while still not being possible to express efficiently on higher level frameworks like TensorFlow or PyTorch. It's also inherently embarrassingly parallel and well suited for running on the GPU.
This post aims to serve as a really basic tutorial on how to write code for the GPU using the CUDA toolkit. I found that CUDA programming was pretty interesting, but it took me a little bit to learn how to do this effectively - and I wanted to share what I learned while it is still fresh in my mind.
A site's robots.txt file advises the web crawlers of the worlds what files they can and can't download. It acts as the first gatekeeper of the internet, unlike blocking the response - it lets you stop requests to your site before it happens. The interesting thing about these files is that it lays out how webmasters intend automated processes should access their websites. While it's easy for a bot to just ignore this file, it specifies an idealized behaviour of how they should act.
As such these files are kind of important. So I thought I'd download the robots.txt file from each of the top million websites on the planet and see what kind of patterns I could find.
I got the list of the top 1 million sites from Alexa and wrote a small program to download the robots.txt file from each domain. With the data all downloaded, I ran each file through pythons urllib.robotparser package and started looking at the results.
One challenge that recommender systems face is in quickly generating a list of the best recommendations to show for the user. These days many libraries can quickly train models that can handle millions of users and millions of items, but the naive solution for evaluating these models involves ranking every single item for every single user which can be extremely expensive. As an example, my implicit recommendation library can train a model on the last.fm dataset in 24 seconds on my desktop - but takes over an hour to use that model to generate recommendations for each user.
This post is about evaluating a couple of different approximate nearest neighbours libraries to speed up making recommendations made by matrix factorization models. In particular, the libraries I'm looking at are Annoy, NMSLib and Faiss.
I've used Annoy successfully for a couple different projects now in the past - but was recently intrigued when I read that NMSLib can be up to 10x faster when using its Hierarchical Navigable Small World Graph (HNSW) index option. I also wanted to try out Faiss after reading the blog post that Facebook Research wrote about it - where they claimed that the GPU enabled version of Faiss was the fastest available option.
Both NMSLib and Faiss turn out to be extremely good at this task, and I've added code to implicit to use these libraries for generating recommendations.
If you look at the programming languages benchmarks game, Python is one of the slowest commonly used programming languages out there. Typical programs written in pure Python average around 40 times slower than the equivalent program written in C or C++.
Despite the performance penalty, Python is still probably the most popular language choice out there for doing Data Analysis and Machine Learning. Most of the recent Deep Learning frameworks target Python for development: TensorFlow, Theano, and Keras all use Python. Torch originally was written for Lua, which is substantially faster than Python when using LuaJIT - but Torch failed to gain traction until switching to Python with the release of PyTorch.
The reason for this is that the performance penalty in writing programs in Python isn't as large as the programming language benchmarks game would suggest: Most of the best Python Data libraries have their core routines written as native extensions.
This all means that to get the most out of these libraries, you need to treat Python as a Declarative Language - and push as much control flow as possible down to a native layer, and just let the Python program describe what needs done.
As part of my post on matrix factorization, I released a fast Python version of the Implicit Alternating Least Squares matrix factorization algorithm that is frequently used to recommend items. While this matrix factorization code was already extremely fast, it still wasn't implementing the fastest algorithm I know about for doing this matrix factorization.
This post is just a quick follow up, talking about why this algorithm is important, where the common solution is slow and how to massively speed up training using a paper based on using the Conjugate Gradient method.
Numerical Optimization is one of the central techniques in Machine Learning. For many problems it is hard to figure out the best solution directly, but it is relatively easy to set up a loss function that measures how good a solution is - and then minimize the parameters of that function to find the solution.
The cool thing about this post is that the code is all running in the browser, meaning you can interactively set hyper-parameters for each algorithm, change the initial location, and change what function is being called to get a better sense of how these algorithms work.
All the code for this post is up on github if you want to check it out, it has both the minimization functions as well as all of the visualizations.
In a previous post I wrote about how to build a 'People Who Like This Also Like ...' feature for displaying lists of similar musicians. My goal was to show how simple Information Retrieval techniques can do a good job calculating lists of related artists. For instance, using BM25 distance on The Beatles shows the most similar artists being John Lennon and Paul McCartney.
One interesting technique I didn't cover was using Matrix Factorization methods to reduce the dimensionality of the data before calculating the related artists. This kind of analysis can generate matches that are impossible to find with the techniques in my original post.
This post is a step by step guide on how to calculate related artists using a couple of different matrix factorization algorithms. The code is written in Python using Pandas and SciPy to do the calculations and D3.js to interactively visualize the results.
As part of writing this post, I also open sourced a high performance python version of the Implicit Alternating Least Squares matrix factorization algorithm. Most of the code here can be found in the examples directory of that project.
By specifying the sizes of each area in the diagram, the library automatically draws a Venn or Euler diagram such that areas displayed have sizes that approximately match the input.
It turned out that displaying the circles is trivial - but calculating the positions of the circles such that the diagram is area proportional is a surprisingly tricky numerical optimization problem.
The original solution I came up with worked fairly well, but there were a couple of minor cases that it broke down on. Since people kept on starring this library on GitHub, I thought I would do them all the favour of fixing these errors by implementing an idea I had for a new layout algorithm.
I tested this new algorithm, and found that it works exceedingly well. In fact it far surpasses the best published academic research on laying out area proportional Venn and Euler diagrams. I've included some interactive graphs showing the performance on this benchmark here, as well as a visualization of how this algorithm works.
Unicode is one of the greatest standards of the modern age, but its simple appearance hides an insane level of complexity.
Unicode aims to represent every possible character in every possible language in a single character encoding. It's an ambitious undertaking, the current version has mapped 113 thousand distinct characters to code points in Unicode - each one of which is given a unique name and description.
Basically every language ever written can be encoded with Unicode now. Even dead languages like Phoenician, Aramaic and the Ancient Greek 'Linear A' script all have code points assigned. Linear A hasn't even been deciphered yet, so there are characters in Unicode that no-one knows what they actually represent!
Likewise Unicode contains a huge assortment of symbols that aren't part of any language. Emoji like a Christmas Tree, a Slice of Pizza, or a Pile of Poop all can be represented with a single Unicode code point.
However, the real craziness with Unicode isn't in the sheer number of characters that have been assigned. The real fun starts when you look at how all these characters interact with one another.
A while ago a friend of mine asked me how I would go about building a 'People Who Like This Also Like ...' feature for a music startup he was working at. For each band or musician, he wanted to display a list of other artists that people might also be interested in.
At the time, I think I arrogantly responded with something like "Thats easy! I can think of like a dozen ways of calculating this!". Which of course was profoundly unhelpful and probably slightly infuriating. Once he calmed down, I sketched out how I would calculate the distance between any two artists - and use that distance as a ranking function to build this feature.
Since then he has been encouraging me to write a blog post about this, and after a totally unreasonable delay I finally got around to finishing it up. So here is my step by step guide for the non data scientist, using Python with Pandas and SciPy to compute the distances, and D3.js for building gratuitously interactive visualizations.
Pretty much every Python programmer out there has broken down at one point and and used the 'pickle' module for writing objects out to disk.
The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs. For these reasons pickle is usually the default serialization mechanism in Python, used in modules likes python-memcached.
However, using pickle is still a terrible idea that should be avoided whenever possible.
One thing this library didn't do though is consider the intersection areas of 3 or more circles when placing each set in the venn diagram. Its a trickier problem than I first thought, mainly because of all the special cases that can arise when the number of circles gets large. While the 2 circle case is a simple calculus problem, I failed to extend this solution to calculate the intersection area of an arbitrary number of circles.
The research papers I read on this both avoided calculating the circle intersection by using approximation techniques. One paper approximated the circles using polygons and used polygon intersection techniques to get the area, and the other approximated by plotting each circle and using binary indexing to compute the area. I tried out the latter approach, but found it to be too slow for realtime use.
Since then, I've had some ideas on different approaches to this problem that I wanted to try out. To keep up with learning D3, I also thought I'd try visualizing each approach here.
This problem comes up all the time when doing search and recommendation type tasks. Every time you query a search engine, it has to find the couple best scored results in all matching items. For example, Google finds 15 million results when querying for 'D3.js', but only shows you the 10 best scored of these. A naive solution for finding these 10 items would be to sort everything by score, but that ends up wasting a ton of time sorting results that will be discarded.
A better solution is to use a min-heap - a tree data structure where each node in the tree has a value smaller than all of its children. Its a fantastically useful data structure, that can be used to efficiently solve this problem. By comparing each item with the root element of an appropriately sized min-heap, and pushing onto the heap when its bigger - it picks out the just the largest items:
Ever wonder why Google+ and Facebook don't let you dislike something? The Facebook like button and the Google +1 button are mostly ubiquitous around the web at this point, but they don't provide the symmetric action for disliking or -1ing anything.
This is in sharp contrast to the rest of the ratings systems around the web, which mostly fall into two camps:
Google+ and Facebook are somewhat unique in that they only offer the ability to give positive feedback. Now part of that reason is undoubtedly that they don't want people going around disliking pages from their advertisers that are the ones actually giving them money. However, I think the biggest reason is that even if they did offer a dislike button, people just wouldn't use it often enough to justify the precious screen real estate it would take up.
Say that one day you're faced with a table of distance information between a bunch of points. Just looking at the table doesn't really provide any real information about the underlying structure of the data, so you want to find a way to visualize this in a way thats more meaningful.
An example of this code in action is below. You can set the sizes of each set, and the sizes of the set intersections - and the code will dynamically compute the best position for each set, and display it with D3:
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." Jamie Zawinski
The other day at work one of our python processes stopped responding. It was on one of our servers that was responsible for fetching and analyzing the web pages that we might recommend to people.
By the time I got around to debugging this problem, the last log message was over 20 minutes old - and the server had been maxed out at 100% CPU for the whole time. And while it wasn't dead, it wasn't responding to any requests, or even sending out any heartbeats. Even worse, in the meantime this problem had repeated itself on another server. Our whole ability to ingest new web pages looked to be going down one server at a time.