A while ago a friend of mine asked me how I would go about building a 'People Who Like This Also Like ...' feature for a music startup he was working at. For each band or musician, he wanted to display a list of other artists that people might also be interested in.
At the time, I think I arrogantly responded with something like "Thats easy! I can think of like a dozen ways of calculating this!". Which of course was profoundly unhelpful and probably slightly infuriating. Once he calmed down, I sketched out how I would calculate the distance between any two artists - and use that distance as a ranking function to build this feature.
Since then he has been encouraging me to write a blog post about this, and after a totally unreasonable delay I finally got around to finishing it up. So here is my step by step guide for the non data scientist, using Python with Pandas and SciPy to compute the distances, and D3.js for building gratuitously interactive visualizations.
Pretty much every Python programmer out there has broken down at one point and and used the 'pickle' module for writing objects out to disk.
The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs. For these reasons pickle is usually the default serialization mechanism in Python, used in modules likes python-memcached.
However, using pickle is still a terrible idea that should be avoided whenever possible.
One thing this library didn't do though is consider the intersection areas of 3 or more circles when placing each set in the venn diagram. Its a trickier problem than I first thought, mainly because of all the special cases that can arise when the number of circles gets large. While the 2 circle case is a simple calculus problem, I failed to extend this solution to calculate the intersection area of an arbitrary number of circles.
The research papers I read on this both avoided calculating the circle intersection by using approximation techniques. One paper approximated the circles using polygons and used polygon intersection techniques to get the area, and the other approximated by plotting each circle and using binary indexing to compute the area. I tried out the latter approach, but found it to be too slow for realtime use.
Since then, I've had some ideas on different approaches to this problem that I wanted to try out. To keep up with learning D3, I also thought I'd try visualizing each approach here.
This problem comes up all the time when doing search and recommendation type tasks. Every time you query a search engine, it has to find the couple best scored results in all matching items. For example, Google finds 15 million results when querying for 'D3.js', but only shows you the 10 best scored of these. A naive solution for finding these 10 items would be to sort everything by score, but that ends up wasting a ton of time sorting results that will be discarded.
A better solution is to use a min-heap - a tree data structure where each node in the tree has a value smaller than all of its children. Its a fantastically useful data structure, that can be used to efficiently solve this problem. By comparing each item with the root element of an appropriately sized min-heap, and pushing onto the heap when its bigger - it picks out the just the largest items:
Ever wonder why Google+ and Facebook don't let you dislike something? The Facebook like button and the Google +1 button are mostly ubiquitous around the web at this point, but they don't provide the symmetric action for disliking or -1ing anything.
This is in sharp contrast to the rest of the ratings systems around the web, which mostly fall into two camps:
Google+ and Facebook are somewhat unique in that they only offer the ability to give positive feedback. Now part of that reason is undoubtedly that they don't want people going around disliking pages from their advertisers that are the ones actually giving them money. However, I think the biggest reason is that even if they did offer a dislike button, people just wouldn't use it often enough to justify the precious screen real estate it would take up.
Say that one day you're faced with a table of distance information between a bunch of points. Just looking at the table doesn't really provide any real information about the underlying structure of the data, so you want to find a way to visualize this in a way thats more meaningful.
An example of this code in action is below. You can set the sizes of each set, and the sizes of the set intersections - and the code will dynamically compute the best position for each set, and display it with D3:
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." Jamie Zawinski
The other day at work one of our python processes stopped responding. It was on one of our servers that was responsible for fetching and analyzing the web pages that we might recommend to people.
By the time I got around to debugging this problem, the last log message was over 20 minutes old - and the server had been maxed out at 100% CPU for the whole time. And while it wasn't dead, it wasn't responding to any requests, or even sending out any heartbeats. Even worse, in the meantime this problem had repeated itself on another server. Our whole ability to ingest new web pages looked to be going down one server at a time.