Venn Diagrams with D3.js

A couple of my goals for this year are to learn both Javascript and D3.js. Its gotten to the point where its embarrassing that I don't know Javascript, and I want to learn D3 since I keep on seeing so many beautiful looking visualizations being made with it. To get started with this, I thought I'd try use D3 to create the simplest possible visualization I could think of - the Venn diagram.

An example of this code in action is below. You can set the sizes of each set, and the sizes of the set intersections - and the code will dynamically compute the best position for each set, and display it with D3:

Its looking like you're missing javascript, or are running an old web browser that doesn't support SVG. If no diagrams show up then you might want to consider loading this up in a different browser. I've tested this with Chrome, Safari and Firefox only.

|A| | |

|B| | |

|C| | |

|A∩B| | |

|A∩C| | |

|B∩C| | |

|A∩CB∩C| |

The hard part here isn't actually displaying the diagram with D3 - its calculating the positions of each set such that the areas of each region are proportional to the size of the set intersections.

The problem is that even for only 3 sets, its not always possible to position everything so that everything is area proportional to the set sizes. Try changing A=B=C=8 , AB=AC=4 and BC=0 in the above example to see what I mean about impossible layouts. There is no way to exactly satisfy all those constraints. Even worse is trying to layout Euler diagrams with 4+ sets: if there are no disjoint sets, its actually impossible to use circles to represent these set relations exactly.

I still want the best possible approximation even if its not possible to be perfect all the time. So I did the sensible thing here: define a function that measures how well a layout matches the desired overlaps - and then minimize that function numerically.

The loss function I'm using is just an ordinary least squares distance, comparing the actual size of each overlapping set to the desired size.

lossFunction = ∑ (actualOverlap

_{i}- desiredOverlap_{i})^{2})

To minimize this function, I started out using the unconstrained minimization function from numeric.js - but ran into some issues with it. This was probably aggravated by occasionally using a discrete approximation to the loss function (more on that in a bit), and by numeric.js attempting to approximate the gradient using a really small step size. So I switched to the downhill simplex method, writing it out myself since I didn't see a Javascript version available.

Using a local optimization technique requires a good initial layout to optimize. Positioning the sets randomly works kind of poorly for large numbers of sets, and I prefer a deterministic solution anyways.

My solution was to use a simple greedy approach. Each set is laid out one at a time, from sets with the biggest intersections to sets with the smallest. For each set that is being added, the distance to each laid out set is computed such that the overlap area is proportional to the size of the set intersection. The distances are used to generate a list of points to evaluate, and the set is added at the point where the loss function is lowest. The points to evaluate are just 90 degrees to each single set, and the two points where the distances are exactly correct relative to each pair of sets (if possible).

This only really leaves calculating the loss function.

Coming up with the formula for the overlap area of two circles was relatively easy - it just required remembering enough calculus to set up the equation, and then getting Wolfram Alpha to do the actual integration for me.

However, I couldn't find a way to scale this up to handle 3+ circles. Quite honestly my rusting calculus skills are just not up to the job here. So I did the logical thing and googled for research on how to compute this.

These seem to be my choices here:

I found an interesting paper on laying out area proportional venn and euler diagrams by Leland Wilkinson. Since computing circle overlaps is apparently uber difficult, he talks about how to approximate this. His approach is to actually plot each set and use binary indexing to figure out the overlap areas.

The idea here is to actually draw each set out, and set the colour for the set such that only one unique bit is set. For example in the 3 set case the colours for each set would be (001, 010, 100) in binary. By drawing out each circle with that colour, and summing the colours in case of overlaps - you can get a diagram where each possible region is represented by a unique colour. Getting the size of all the regions is as simple as calculating a colour histogram on the drawn out image.

Implementing this in Javascript means using Canvas operations, since that seems to be the only way to directly manipulate images and get the pixel values. On a 200x200 canvas, calculating the loss like this takes about 6ms on my laptop. Which sounds good, until you realize that we need to calculate the loss hundreds of times during the optimization step.

For offline processing, this wouldn't be a huge deal - but a second processing delay is way too slow for interactive use. Even worse, is that my browser seems to totally lock up while this is going on.

Also the output between using 3+ overlaps and not usually isn't all that different. Getting the overlaps of 2 sets right, usually means that the overlaps of 3 or more sets are basically right - or the problem is so over-constrained that its impossible to get right. I ripped off an example of this from the Wilkinson paper to demonstrate what I mean here:

Output considering only 2 overlaps

Diagram from Wilkinson paper

While they look somewhat different, if you rotate one of them and take the mirror image - they are in fact almost identical. The only real difference is in the intersection of the 'Anti-CCP & DAS28 & SE' set overlaps - where its marginally bigger in the Wilkinson version. While the Wilkinson version is more accurate, the size of this 3 way intersection is supposed to be the same as the intersection of 'Treat & SE', so both versions are wrong here - because its impossible to lay this out perfectly.

Since its so slow, and doesn't seem to have a huge benefit - I disabled this method of calculating the loss function by default. Instead I just calculate the loss on only the intersection of 2 sets, and ignore the higher order effects. If its going to be an approximation anyways, I'd much rather have a fast approximation.

The Wilkinson paper is an interesting read if you're into that kind of thing. While our approaches are vaguely similar, his initial layout and optimization method are quite different from the approach I took.

The venn diagrams above don't really test out how well this performs on more complicated layouts. So I took some datasets I happened to have on my computer, and ran some of them through here.

Lets try visualizing if all roads really do lead to Radiohead . Each artist is represented by the set of users that played them, with the data being grabbed from a dataset of 360k last.fm users.

The goal here is to visualize the pernicious effect of popularity on collaborative filtering based recommender systems. Most of the artists here have more users in common with Radiohead than any other band, despite coming from different genres and eras of music. This effect totally dominates this graph, with Radiohead being at the centre and all other artists clustering around them.

As a final example, lets see if we can get anything interesting out of the Netflix Prize dataset. I picked 6 movies, kind of at random - and then represented them using the set of users that gave the movie a 5 star rating. While its not all that interesting, I still like it because the movies here are sort of arranged by quality, with worse movies being farther right:

I've put the code on github here for anyone that wants to play around with it. Since its the first time I've written more than 10 lines of Javascript code, its probably mostly awful - but someone may still find it useful.

While this code works reasonably well, there are a couple of areas that I think could use some improvement. The greedy layout I'm using sometimes picks up a local minima that the optimizer can't step out of. This is somewhat visible in the lastfm example, where the total loss could be reduced if John Lennon was inserted in between Elvis and Morrisey (as well as Philip Glass near Bach). I have some ideas on fixing this, but since this is really getting away from my goals here of learning Javascript and D3 I thought I'd release this as is.

In the end, this wasn't a great project to learn D3 on - since the D3 part of it was so easy. The only tricky part in using D3 was getting the labels centred in the circles, and that really wasn't that hard. It was a decent first project in Javascript though. I got along pretty well here once I started treating Javascript as being Python code with a C++ syntax, and just googling for the rare cases where that didn't just work. I got a little hung up on some of Javascripts limitations (dictionaries can only be keyed by strings etc) but its not a completely horrible language to work with.