A Better Algorithm for Area Proportional Venn and Euler Diagrams

A while ago I wrote a small library for displaying Venn and Euler diagrams when trying to learn Javascript.

By specifying the sizes of each area in the diagram, the library automatically draws a Venn or Euler diagram such that areas displayed have sizes that approximately match the input.

It turned out that displaying the circles is trivial - but calculating the positions of the circles such that the diagram is area proportional is a surprisingly tricky numerical optimization problem.

The original solution I came up with worked fairly well, but there were a couple of minor cases that it broke down on. Since people kept on starring this library on GitHub, I thought I would do them all the favour of fixing these errors by implementing an idea I had for a new layout algorithm.

I tested this new algorithm, and found that it works exceedingly well. In fact it far surpasses the best published academic research on laying out area proportional Venn and Euler diagrams. I’ve included some interactive graphs showing the performance on this benchmark here, as well as a visualization of how this algorithm works.

The Problem

The goal here is to position circles such that the intersection areas of those circles are proportional to the set intersection sizes that are passed in as input.

My original solution was to minimize a sum of squared errors function comparing the actual intersection sizes to the desired sizes:

$$loss = \sum_i { (desiredArea_i - actualArea_i)^2} $$

I minimized this equation using the Nelder-Mead downhill simplex method, using a greedy approach to generate the initial layout. I also tried out using Multidimensional Scaling to generate the initial layout, but found it didn’t work as well.

Testing

Its pretty hard to make improvements to an algorithm without having a test to measure those improvements against.

To test out the layout algorithms, I’m randomly laying out some circles and calculating the intersection areas from those circles. Using only the intersection areas, the layout algorithm has to produce a venn diagram that is isomorphic to the original input.

Here is what the test looks like on my original solution:

Original

Reconstructed

The reconstructed output on the right should have areas that are proportional to the randomly drawn circles on the left. Any errors will be highlighted in red - hitting the ‘Find Next Failure’ button will run trials of this test until it fails. I’m calling any region that differs in size by more than 10% an error, at which point the trial is labelled a failure.

The difficulty of this test is pretty dependent on how the circles are randomly positioned. I’m placing each circle uniformly in (0,1) and then letting you vary the radius to increase or decrease the difficulty. Setting the radii to be in (.8, 1) makes this a much easier problem, while setting in (.1, 0.3) makes it harder.

To view the overall performance of each algorithm, I ran 10,000 trials of this test and aggregated the results:

Algorithm Performance

The good news here is that the existing approach works basically perfectly for 2 and 3 circle cases. The bad news is that performance drops off steeply past that. Most of the heavy lifting seems to be done by the initial layout - the Nelder-Mead optimization method I’m using is a local optimizer, and if the circles are disjoint or contain subsets the gradient of the loss function is flat. This makes it difficult for the optimizer to improve upon the initial layouts in this case.

The Greedy initial layout works decently well, but by placing circles one at a time it can produce layouts that are suboptimal globally. The MDS initial layout doesn’t work quite as well, but does a better job of capturing global structure.

Constrained Multidimensional Scaling

My idea was to improve the Multidimensional Scaling layout to be aware of subsets and disjoint circles.

The interesting thing about MDS is that it positions the sets by optimizing the distances between the circles, instead of the intersection areas directly. We calculate the euclidean distances by bisection on the desired area sizes, and then optimize the layouts for each point on a sum of squared errors loss function on these distances.

Classic MDS works perfectly in the case where there is no disjoint or subset relationships, but fails when there is.

For disjoint sets we don’t care what the distance is as long as its big enough that the two circles are completely separate. Likewise for subsets, we don’t care what the distance is as long as its small enough that the bigger circle completely overlaps the smaller one. The reason that classic MDS fails with disjoint/subset relationships is that its trying to optimize a single distance, when a range of possible distances would work.

So the idea here is to make the MDS layout aware of the constraints with subset and disjoint sets.

This requires selecting points such that distances between each pair of points X_i and X_j approximates the desired distance D_ij between them:

$$loss = \sum_i \sum_j { {\begin{cases} 0 & disjoint(i, j) \text{ and } (X_{i} - X_{j})^T(X_{i} - X_{j}) >= D_{ij}^2 \\ 0 & subset(i, j) \text { and } (X_{i} - X_{j})^T(X_{i} - X_{j}) <= D_{ij}^2 \\ ((X_{i} - X_{j})^T(X_{i} - X_{j}) - D_{ij}^2) ^2 & \text{otherwise} \\ \end{cases}}}$$

To minimize this function I’m using the Polak–Ribière Conjugate Gradient Method. This requires the derivative of the loss function, which can be gotten for each point X_i by applying the chain rule to the loss function twice:

$$ \nabla f(X_{i}) = \sum_j {\begin{cases} \vec{0} & disjoint(i, j) \text{ and } (X_{i} - X_{j})^T(X_{i} - X_{j}) >= D_{ij}^2 \\ \vec{0} & subset(i, j) \text { and } (X_{i} - X_{j})^T(X_{i} - X_{j}) <= D_{ij}^2 \\ 4 {((X_{i} - X_{j})^T(X_{i} - X_{j}) - D_{ij}^2)} (X_{i} - X_{j}) & \text{otherwise} \\ \end{cases}} $$

You can see the progress down below here. The CG optimizer is a iterative process, starting with a random initial configuration. I’ve animated the iterations so you can see the progress on our test:

Original

Reconstructed
loss=

Iteration

This works better than any other initial layout algorithm I’ve tried, but it still does occasionally get stuck in a local minima.

However, we can overcome these local minima by running this algorithm multiple times on different random initial configurations. Hit the ‘Recalculate Solution’ button to try another random starting position on the same input.

Running this 10 times on even the 8 set case happens on my laptop in under 2 ms, and leads to almost perfect results:

Initial Layout Performance

One downside here is that the diagrams can be rotated randomly, which leads to different looking output on different runs - which isn’t ideal. To handle this I’ve added some venn diagram normalization code which should remove most of the major differences. You can see the normalization code in action on its test page.

Also, with imperfect input the Greedy layout can occasionally come up with a better starting position. So the final solution is to run both 10 trials of this algorithm and the Greedy layout and evaluate which layout has the lowest loss. This layout is passed to the Nelder-Mead optimizer for fine tuning.

Overall performance can be viewed here. This algorithm is now pushed to the latest version of my venn.js library.

Comparison to VennEuler

Leland Wilkinson wrote an excellent paper on producing area proportional Venn and Euler diagrams, and made the source code available in the VennEuler package. As far as I’m aware this is the previous best published research on laying out Venn and Euler diagrams.

VennEuler doesn’t perform nearly as well on this test as venn.js does. The reason seems to be that while VennEuler frequently gets a solution that is close to being correct, it rarely gets a solution that is close enough for this test to say it succeeded.

I’ve written up a comparison between VennEuler and venn.js that talks about reasons that VennEuler isn’t performing all that well on this test.

While VennEuler is an excellent paper and a good solution, venn.js consistently outperforms it after these latest changes.