An example of this code in action is below. You can set the sizes of each set, and the sizes of the set intersections - and the code will dynamically compute the best position for each set, and display it with D3:
The hard part here isn't actually displaying the diagram with D3 - its calculating the positions of each set such that the areas of each region are proportional to the size of the set intersections.
The problem is that even for only 3 sets, its not always possible to position everything so that everything is area proportional to the set sizes. Try changing A=B=C=8 , AB=AC=4 and BC=0 in the above example to see what I mean about impossible layouts. There is no way to exactly satisfy all those constraints. Even worse is trying to layout Euler diagrams with 4+ sets: if there are no disjoint sets, its actually impossible to use circles to represent these set relations exactly.
I still want the best possible approximation even if its not possible to be perfect all the time. So I did the sensible thing here: define a function that measures how well a layout matches the desired overlaps - and then minimize that function numerically.
The loss function I'm using is just an ordinary least squares distance, comparing the actual size of each overlapping set to the desired size.
lossFunction = ∑ (actualOverlapi - desiredOverlapi )2 )
Using a local optimization technique requires a good initial layout to optimize. Positioning the sets randomly works kind of poorly for large numbers of sets, and I prefer a deterministic solution anyways.
My solution was to use a simple greedy approach. Each set is laid out one at a time, from sets with the biggest intersections to sets with the smallest. For each set that is being added, the distance to each laid out set is computed such that the overlap area is proportional to the size of the set intersection. The distances are used to generate a list of points to evaluate, and the set is added at the point where the loss function is lowest. The points to evaluate are just 90 degrees to each single set, and the two points where the distances are exactly correct relative to each pair of sets (if possible).
This only really leaves calculating the loss function.
Coming up with the formula for the overlap area of two circles was relatively easy - it just required remembering enough calculus to set up the equation, and then getting Wolfram Alpha to do the actual integration for me.
However, I couldn't find a way to scale this up to handle 3+ circles. Quite honestly my rusting calculus skills are just not up to the job here. So I did the logical thing and googled for research on how to compute this.
I found an interesting paper on laying out area proportional venn and euler diagrams by Leland Wilkinson. Since computing circle overlaps is apparently uber difficult, he talks about how to approximate this. His approach is to actually plot each set and use binary indexing to figure out the overlap areas.
The idea here is to actually draw each set out, and set the colour for the set such that only one unique bit is set. For example in the 3 set case the colours for each set would be (001, 010, 100) in binary. By drawing out each circle with that colour, and summing the colours in case of overlaps - you can get a diagram where each possible region is represented by a unique colour. Getting the size of all the regions is as simple as calculating a colour histogram on the drawn out image.
For offline processing, this wouldn't be a huge deal - but a second processing delay is way too slow for interactive use. Even worse, is that my browser seems to totally lock up while this is going on.
Also the output between using 3+ overlaps and not usually isn't all that different. Getting the overlaps of 2 sets right, usually means that the overlaps of 3 or more sets are basically right - or the problem is so over-constrained that its impossible to get right. I ripped off an example of this from the Wilkinson paper to demonstrate what I mean here:
While they look somewhat different, if you rotate one of them and take the mirror image - they are in fact almost identical. The only real difference is in the intersection of the 'Anti-CCP & DAS28 & SE' set overlaps - where its marginally bigger in the Wilkinson version. While the Wilkinson version is more accurate, the size of this 3 way intersection is supposed to be the same as the intersection of 'Treat & SE', so both versions are wrong here - because its impossible to lay this out perfectly.
Since its so slow, and doesn't seem to have a huge benefit - I disabled this method of calculating the loss function by default. Instead I just calculate the loss on only the intersection of 2 sets, and ignore the higher order effects. If its going to be an approximation anyways, I'd much rather have a fast approximation.
The Wilkinson paper is an interesting read if you're into that kind of thing. While our approaches are vaguely similar, his initial layout and optimization method are quite different from the approach I took.
The venn diagrams above don't really test out how well this performs on more complicated layouts. So I took some datasets I happened to have on my computer, and ran some of them through here.
The goal here is to visualize the pernicious effect of popularity on collaborative filtering based recommender systems. Most of the artists here have more users in common with Radiohead than any other band, despite coming from different genres and eras of music. This effect totally dominates this graph, with Radiohead being at the centre and all other artists clustering around them.
As a final example, lets see if we can get anything interesting out of the Netflix Prize dataset. I picked 6 movies, kind of at random - and then represented them using the set of users that gave the movie a 5 star rating. While its not all that interesting, I still like it because the movies here are sort of arranged by quality, with worse movies being farther right: