Say that one day you're faced with a table of distance information between a bunch of points. Just looking at the table doesn't really provide any real information about the underlying structure of the data, so you want to find a way to visualize this in a way thats more meaningful.

For instance take a look of this table of distances between US cities:

Atlanta Chicago Denver Houston Los Angeles Miami New York San Francisco Seattle Washington, DC
Atlanta 05871212701193660474821392182543
Chicago 58709209401745118871318581737597
Denver 121292008798311726163194910211494
Houston 701940879013749681420164518911220
Los Angeles 1936174583113740233924513479592300
Miami 6041188172696823390109225942734923
New York 7487131631142024511092025712408205
San Francisco 2139185894916453472594257106782442
Seattle 21821737102118919592734240867802329
Washington, DC 543597149412202300923205244223290

 

While packed full of information, this doesn't really provide any context for the exact relations between these cities. A much more useful visualization would be a map of these points.

The challenge then is to produce coordinates for each item that best approximate the distances in the table. This type of problem is well solved by a set of techniques called Multidimensional Scaling (MDS). There is a good overview of MDS variants on the page where I stole this example from, but we only really need the classic version here. In this case, we just choose coordinates for each item such that they minimize the squared error between euclidean distances of the coordinates themselves and the entries in the distance matrix.

Finding these coordinates is actually a fairly simple calculation: you just square each value in the distance matrix, double centre that such that the columns and rows both have a zero mean, and then take the SVD of that matrix. The point coordinates are then in the factors returned by the SVD. A full derivation of this algorithm can be found here for anybody thats interested in why this works.

Displaying the output of this algorithm using D3.js, and a relatively sane looking output appears:

Obviously, the axes point in the opposite direction than normal in this diagram, which is because I cheated a little to get the output look a little more familiar. The raw distance matrix doesn't encode orientation information, so this process produced a set of coordinates that are the mirror image and rotated to what we're used to. However, the relative distances between all the cities here are almost exactly correct, which is the goal here.

In javascript using numeric.js, the code to produce these coordinates is fairly concise:

/// given a matrix of distances between some points, returns the
/// point coordinates that best approximate the distances
mds.classic = function(distances, dimensions) {
    dimensions = dimensions || 2;

    // square distances
    var M = numeric.mul(-.5, numeric.pow(distances, 2));

    // double centre the rows/columns
    function mean(A) { return numeric.div(numeric.add.apply(null, A), A.length); }
    var rowMeans = mean(M),
        colMeans = mean(numeric.transpose(M)),
        totalMean = mean(rowMeans);

    for (var i = 0; i < M.length; ++i) {
        for (var j =0; j < M[0].length; ++j) {
            M[i][j] += totalMean - rowMeans[i] - colMeans[j];
        }
    }

    // take the SVD of the double centred matrix, and return the
    // points from it
    var ret = numeric.svd(M),
        eigenValues = numeric.sqrt(ret.S);
    return ret.U.map(function(row) {
        return numeric.mul(row, eigenValues).splice(0, dimensions);
    });
};

Applications to laying out Venn Diagrams

I first stumbled across this family of algorithms a couple of weeks ago when I was reading a paper by Leland Wilkinson for my post on laying out area proportional venn diagrams. We both had taken the approach of generating an initial layout for the venn diagrams, and then optimizing that layout. While I just came up with a simple greedy approach to generate the initial layout, he used MDS instead.

While my greedy approach usually worked really well, there was one example in the Wilkinson paper where his approach was clearly superior:

Aside from the 'B' and 'E' sets overlapping slightly here when they should be disjoint, this diagram is basically perfect. The greedy version here is fairly flawed in comparison - with the E set dangling off and not connecting to the D set at all:

Note: you need Javascript enabled and a SVG compatible browser to view the venn diagrams below
Initial Greedy Layout
Optimized Greedy Layout

The problem here is the E set is the last set laid out by the greedy layout, since it has the least amount of overlap of any set in the diagram. While it overlaps the D and F sets, they have been placed too far apart for this set to bridge that gap, creating a layout with a local minima that the local optimization technique I used couldn't step out of.

By calculating the appropriate euclidean distance between each pair of sets numerically, and then passing that distance matrix to the MDS algorithm above, we get a layout that can be optimized to a much better solution:

Initial MDS Layout
Optimized MDS Layout

Its worth pointing out the Wilkinson paper uses MDS on the Jaccard distance between the sets, rather than Euclidean distance I used here. This doesn't seem to change the results though, both approaches lead to a very similar diagram.

Conclusion

I've put my javascript version up on github, and I've also updated the venn.js code to optionally use this as an initial layout.

I was about to use MDS by default for the venn.js library, but the greedy layout seems to do a better job on most other examples I tried. The problem seems to be that MDS spends an equal amount of effort trying to get the distances right for small intersection areas, whereas the greedy layout only looks at the largest intersections. Even in the example above the greedy version actually produces a diagram with a much lower loss, it just doesn't optimize as well as the MDS version.

Other MDS variants might overcome this problem though. I looked at using a weighted MDS calculation, with a vague notion of using the size of the set intersections as the weights to use. Unfortunately, all the information I could find about WMDS online was only talking about weighting the rows of the matrix rather than the entries - and the primary research is behind an academic paywall. Next time I'm near a university library I probably see about getting it, but I'm stopping here in the meantime.

While it didn't really solve my specific problem as well as I would have liked, its still a cool algorithm.