Distance Metrics for Fun and Profit

A while ago a friend of mine asked me how I would go about building a 'People Who Like This Also Like ...' feature for a music startup he was working at. For each band or musician, he wanted to display a list of other artists that people might also be interested in.

At the time, I think I arrogantly responded with something like "Thats easy! I can think of like a dozen ways of calculating this!". Which of course was profoundly unhelpful and probably slightly infuriating. Once he calmed down, I sketched out how I would calculate the distance between any two artists - and use that distance as a ranking function to build this feature.

Since then he has been encouraging me to write a blog post about this, and after a totally unreasonable delay I finally got around to finishing it up. So here is my step by step guide for the non data scientist, using Python with Pandas and SciPy to compute the distances, and D3.js for building gratuitously interactive visualizations.

The first step here is getting a good source of data.

For this post, I'm using a dataset of last.fm users, that was gathered from the last.fm API way back in 2008. For 360,000 last.fm users, it contains each users top 50 most played artists as well as the number of times that artist was played. All told there are about 17 million user/artist/playcount triples here.

Loading this dataset up is a one line command using Pandas:

data = pandas.read_table("usersha1-artmbid-artname-plays.tsv", usecols=[0, 2, 3], names=['user', 'artist', 'plays'])

With the data loaded up, all that is required is to calculate the similarity between each pair of artists, and then use that as a sort order to get the most similar artists.

The easiest way of calculating similarity between two artists is to ignore the play counts, and just treat each artist as a set of the users that have played them:

# create a dictionary of artist name to the set of their users artist_sets = dict((artist, set(users)) for artist, users in data.groupby('artist')['user'])

The naive approach here is to calculate the similarity between two artists just by looking at the number of users that the two artists have in common:

def overlap(a, b): return len(a.intersection(b))

The problem with this method is that basically all artists will have Radiohead, Coldplay or The Beatles in their top related artists. These are the 3 most popular bands in the dataset we're using, and because of popularity effects will tend to dominate the results for most bands here.

You can see this problem pretty clearly by looking at the venn diagram of the top related artists for 'Kanye West' by overlap:

Its looking like you're missing javascript, or are running an old web browser that doesn't support SVG. If no diagrams show up then you might want to consider loading this up in a different browser. I've tested this with Chrome, Safari and Firefox only.

The top related artist here is Coldplay, even though Kanye West and Coldplay aren't especially similar. Jay-Z would be a better choice to display as a similar artist, but the massive audience in this dataset for Coldplay has ensured that its at the top of the list, even though only a small fraction of Coldplay users also listened to Kanye West. You can also see this problem with many other artists too - just keep in mind that the dataset is several years old now, so queries for 'Justin Bieber' are thankfully going to fail.

The most common way of dealing with this problem is Jaccard distance, which normalizes the intersection size of the two sets by dividing by the total number of users that have listened to either artist. The size of the union of the two sets can be efficiently calculated from the intersection size by using the inclusion-exclusion principle leading to code like:

def jaccard(a, b): intersection = float(len(a.intersection(b))) return intersection / (len(a) + len(b) - intersection)

There are dozens of other set based distances that could be used here instead. For instance the Dice coefficient normalizes the overlap by the mean of the set sizes, and the Ochiai distance which normalizes by the geometric mean. They all tend to produce somewhat similar looking results though.

While the results here are starting to look somewhat respectable, metrics like the Jaccard distance bias results towards having artists that have a similar number of users in their sets. For instance, 'The Beatles' doesn't have 'John Lennon' in the top related artists by Jaccard distance even though that would be a great match.

The set based methods throw away a ton of useful information: how often the user has listened to the artist. This treats the most diehard fan of an act the same as a someone that only listened to a band once, and discovered that they didn't really care for them.

The simplest way of using the play counts here is by looking at this as a geometry problem, and measure the angle between each pair of artists. Since this is a little abstract, take a look at this graph of the play counts for just 2 users:

Each artist is represented by the vector of the play counts for the two users. Since each user has listened to all three artists - the Jaccard distance will be the same between all artists. But looking at the graph, its pretty clear that Jay-Z is much more similar to Kanye West than Coldplay is because of the small angle between them.

Of course we have 360,000 users in our data set to deal with, instead of just the 2 that are displayed here. The basic idea is the same though: treat each artist as a vector in the 360K dimensional space formed by all the users, and then calculate the angle between the artists with the smallest angle indicating the most similar.

Computing this angle is pretty easy. The first step is to represent each artist as a sparse vector of the play counts for each user:

# map each username to a unique numeric value userids = defaultdict(lambda: len(userids)) data['userid'] = data['user'].map(userids.__getitem__) # map each artist to a sparse vector of their users artists = dict((artist, csr_matrix( (group['plays'], (zeros(len(group)), group['userid'])), shape=[1, len(userids)])) for artist, group in data.groupby('artist'))

Once we have each artist represented as a sparse vector of their users, we just need to calculate the angle between these sparse vectors to get how similar they are. The easiest way to do this is to calculate the cosine of the angle, which is just the dot product of the two vectors divided by the L2 norm of each vector:

def cosine(a, b): return dot(a, b.T)[0, 0] / (norm2(a) * norm2(b)) def norm2(v): return sqrt((v.data ** 2).sum())

While its trivial to convert the cosine back to an angle, we're interested in coming up with a ranking function - and this won't change the ranking. So its usual just to work with the cosine values directly.

Cosine distance succeeds in bringing up more relevant similar artists than the set based methods, but unfortunately there is also significantly more noise. A good example of this problem can be seen in the related artists for Radiohead:

The most related artist by cosine is Thom Yorke, which is an excellent result for being similar Radiohead since he's the lead singer. However, the second result is Samarah - who only has two users out of the 360,000 in this dataset, and only one of those users has also listened to Radiohead. If our goal is to reduce the number of WTF's in the results - cosine is actually a step backwards.

The problem is that this one user is a die-hard Radiohead fan, and has also listened to Radiohead songs over 100,000 times. Even though Radiohead has 77,151 other users, this single user has enough total plays to totally skew the results.

While there are a bunch of more principled methods to overcome this that I'll talk about below, one simple hack that I've seen used before is to smooth the cosine by the number of overlapping users:

SMOOTHING = 20 def smoothed_cosine(a, b): # calculate set intersection by converting to binary and taking the dot product overlap = dot(binarize(a), binarize(b).T)[0, 0] # smooth cosine by discounting by set intersection return (overlap / (SMOOTHING + overlap)) * cosine(a, b) def binarize(artist): ret = csr_matrix(artist) ret.data = ones(len(artist.data)) return ret

By setting the SMOOTHING constant to 20, the scores where there is only 1 user in common are reduced to about 5% of their original cosine. This completely eliminates the noisy results from the head of the list here.

People building search engines have developed some pretty nice models for calculating similarity between query strings and text documents. These models can be easily adapted to our purposes, by treating each artist as a document and each user as a term in those documents.

The most well known IR model is TF-IDF, which like its name suggests has two components.

The first is applying a function to the term frequency (TF) for each document. I'm using Lucene's model for TF here - and replacing each play count by its square root. While academic literature usually uses the logarithm of the term frequency, taking the square root is very similar to Bhattacharya Distance so its not totally unjustified. This effectively takes the geometric middle ground between the Cosine and Jaccard distances - and provides a nice balance between these two extremes.

The second part is Inverse Document Frequency (IDF). The problem with all the models so far is that we're ignoring the overall activity of each user: users that listen to only a handful of bands are treated the same as those that listen to everything. For purposes of calculating similarity, we want to weigh people that are selective in who they listen to more than people who aren't. IDF achieves this by multiplying in the logarithm of the inverse probability that a user will listen to an artist.

The first step is to precalculate the IDF values for each user:

# calculate IDF for each user N = len(artists) idf = [1. + log(N / (1. + p)) for p in data.groupby('userid').size()]

With the IDF values precalculated, TF-IDF distance is just the cosine of the weighted vectors:

# weights a sparse vector by tfidf def tfidf_weight(artist, idf): ret = csr_matrix(artist) ret.data = array([(sqrt(plays) * idf[userid] for plays, userid in zip(artist.data, artist.indices)]) return ret # tfidf distance is just the cosine between tfidf weighted vectors def tfidf(a, b, idf): return cosine(tfidf_weight(a, idf), tfidf_weight(b, idf))

TF-IDF is probably the sanest initial scheme to try when calculating anything like this. Its missing the noise of the cosine, but it still returns better results than Jaccard distance. However, there are information retrieval models that do a much better job at ranking.

BM25 usually produces much better results than TF-IDF. Instead of using a TF function like the square root, BM25 uses a weighting function that can be tuned based off a parameter K1:

def bm25_tf_weight(plays): return plays * (K1 + 1.0) / (K1 + plays)

By changing the value of K1 in this function, we can change the shape to go between the step function used in the Jaccard distance (K1 = 0) and the linear weighting used in the Cosine distance (K1 = +infinity). You can see the effects of changing the K1 parameter by changing the slider below:

K1 = 1.2

The only function changing in that graph is BM25 - I'm changing the scale at the same rate as the K1 value, so it only looks like the other functions are changing. This clearly shows the asymptote at (K1 + 1) in this function, and that K1 mainly changes the scale of this function rather than the shape for higher values of K1.

The other major change with BM25 is how length normalization is handled. Play counts are scaled by the ratio of the size of the document to the average size. But since sometimes it makes sense to prefer longer documents (or more popular artists in our case), BM25 also introduces a parameter B which controls how much influence the length normalization has on the results.

Putting it all together and you get code looking like:

def bm25(a, b, idf, average_plays): return dot(bm25_weight(a, idf, average_plays), bm25_weight(b, idf, average_plays).T)[0, 0] def bm25_weight(artist, idf, average_plays): ret = csr_matrix(artist) length_norm = (1.0 - B) + B * artist.sum() / average_plays ret.data = array([(plays * (K1 + 1.0) / (K1 * length_norm + plays)) * idf[userid] for plays, userid in zip(artist.data, artist.indices)]) return ret

The usual value of K1 used in text search is around 1.2, which makes sense for text queries as its more important to match documents containing all of the terms in the query instead of matching repeated terms. For our music data, I'm setting K1 to a 100 which seems to produce decent results. I'm also setting B to 0.5 here, which introduces a slight bias back towards popular artists.

Using a slopegraph we can easily compare the rankings for two metrics on any single artist. For instance, BM25 distance cleanly picks out each individual member of The Beatles, compared to Jaccard where these things are buried way down in the tail:

Jaccard versus BM25

This graph also lets us easily see how the Smoothed Cosine eliminates the noise in the Cosine distance from our Radiohead example or how Jaccard distance somewhat mitigates the effects of popularity in the Kanye West example.

I prefer either the Smoothed Cosine or BM25 in this case, but thats just my opinion. If possible run an A/B test to determine which metric to use, otherwise you'll need to come up with a good offline measure.

People who like this post also like the next blog post in this series on matrix factorization methods.

This post really only scratches the surface of whats possible - if you're interested in learning more I'd recommend picking up a book on Information Retrieval. I quite like Information Retrieval: Implementing and Evaluating Search Engines. It talks about techniques like Learning to Rank, Relevance Feedback and Search Result Fusion - which can all be used to improve the results. The Manning IR book is also decent and freely available online.

The code here is mainly meant to illustrate the techniques, but its all up on my github if you're interested in running this yourself.

Published on 27 April 2015