Science

“Detecting Novel Associations in Large Data Sets” — let the giants battle it out!


Computational and statistics papers usually don’t make it to glossy high-impact journals. “Your manuscript seems better suited for a more technical journal” is a regular response for submissions focussing on theory not data.

But sometimes these papers make it through, usually to Science, which has a much better track record for theoretical papers than Nature. An encouraging recent example is Detecting Novel Associations in Large Data Sets by Reshef et al in Science:

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships. *

The project seems to be driven by the Broad Institute and I can’t image that having Eric Lander as a co-author had been a disadvantage for their paper. The Broad even produced a movie to promote their new method. You can download their software including a wrapper for R.

The giants of statistics speak

Mutual information and correlation measures are not what you would call a new field in statistics and data analysis. So what does the establishment think about the new method?

The first responses I saw were really good: Terry Speed from Berkeley wrote a very positive commentary in the same issue of Science calling MIC “a correlation for the 21st century”. And Andrew Gelman from Columbia University wrote a blog post ‘Mr. Pearson, meet Mr. Mandelbrot’ saying “My quick answer is that it looks really cool!” Congratulations! Broad, you did well!

But wait, let’s not get ecstatic quite yet … a recent comment by Noah Simon and Rob Tibshirani from Stanford looks different (thanks to Simply Statistics for the link). It’s short enough to quote it in its entirety (emphasis is mine):

The proposal of Reshef et. al. (“MIC”) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations.

To investigate this, we ran simulations to compare the power of MIC to that of standard Pearson correlation and distance correlation (dcor) Székely & Rizzo (2009). We simulated pairs of variables with different relationships (most of which were considered by the Reshef et. al.), but with varying levels of noise added. To determine proper cutoffs for testing the independence hypothesis, we simulated independent data with the appropriate marginals. As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome. This set of dependencies is by no means exhaustive, however it suggests that MIC has serious power deficiencies, and hence when it is used for large-scale exploratory analysis it will produce too many false positives. The “equitability” property of MIC is not very useful, if it has low power.

We believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use. *

A full R language script for Simon & Tibshirani’s analysis is here. The competing (and less well popularized: no movies!) approach they refer to is Székely and Rizzo, ‘Brownian distance covariance’, Annals of Applied Statistics 2009.

Shoot! Now I’m confused again. Trusting Speed and Gelman I had planned to recommend MIC to all people in my lab, but “serious power deficiencies” doesn’t sound so good!

Maybe it’s best to get out of the way while Lander, Speed, Gelman and Tibshirani battle it out and start testing the different approaches myself.

Let’s see what the final verdict is …

Florian

Update Jan 26 2012: Jeff at Simply Statistics takes the MIC paper as a reason to ask When should statistics papers be published in Science and Nature? and discusses the difference between groundbreaking and definitive papers. Glossy journals tend to err on the side of groudbreaking papers, which are good for their impact factor. The same came out in a post I wrote a while ago on scientific de-discovery and forensics.

Image source: http://en.wikipedia.org/wiki/File:Correlation_examples2.svg

18 thoughts on ““Detecting Novel Associations in Large Data Sets” — let the giants battle it out!

  1. Please let us know the outcome. I have use for it as well, might even test the two methods myself if I find the time, but you are definitely the better statistics reference than I am…Seb

    Like

  2. I would add that I was disappointed that the code was not open source. And there is little public detail. Don’t know how much detail is behind the paywall,, a good example of why this sort of thing should be open source and published in freely accessible locations.

    Like

    1. look on the authors’ website for a link to the reprint version. If it doesn’t include the supplement (I *think* it does) then email the authors… I was happy that they directly linked the reprint, though making the paper Open Access would be nice, perhaps this saves somebody a few thousand dollars and begets the same outcome?

      Like

  3. The first comment that I have after a first trial run is that what they consider a “large” dataset is ridiculous small. I tried to run it on ~890 samples (columns) and ~50,000 TSS (, rows, ok maybe a bit big, but if I advertise something for “large” datasets, then this is rather the norm than the exception). What I really wanted was to run each sample vs. each sample. I gave up on the first run after the java executable was still not finished after 2 days…

    I also vote for open source and code that works on datasets that are bigger then ~30MB…
    As a concept it seams quite nice, as a usable method I was unlucky so far… probably a code written in c and heavily parallelized could solve some of the issues… or I just plainly did something wrong….

    Like

    1. Hi Sebastian,

      I’ve been running large datasets in MINE since it debuted — 400,000 to 1,000,000 rows, but only 10 to 15 columns. Running those datasets in Java via the command line finishes in about 50 minutes to 1.5 hours. Are you running MINE with the “-allPairs” switch?

      I would suggest running the sample WHO data they provide on the MINE website to make sure your setup is running properly; that analysis should take 30-40 minutes with “-allPairs”.

      Like

  4. The distance correlation is much faster also.

    Also the MIC seems to be slower and less powerful (using a modified copy of Simon and Tibsharani’s code) than computing the mutual information on the ranks of the two variables from a fixed set of bins. In my tests using Simon and Tibsharani’s examples, the latter outperformed the MIC in all the rapidly oscillating sine case and under-performed the distance correlation for all but the circle and the two sine waves.

    The latter algorithm is simple to implement and scales as N-log-N (N=number of points). One transforms the two variables by replacing each value with its rank (e.g. x -> rank(x), y->rank(y)). One then choses a grid spacing (preferably a factor of N) and computes the mutual information from that grid.

    Using the MIC saves you from worrying about what grid spacing to use.

    Like

    1. Jayzus suffering Chrikey: thank god someone else thought to do this. I thought I was losing my marbles. Normalized MI was the first thing I reached for, and it’s ridiculously quicker and seems to work as well or better. dcor was going to be the second thing I looked at, but I guess Tibsharani beat me to it.
      If you look at the detailed SOM, they’re basically calculating a sort of MI using an … interesting new form of discretization.
      I envy their publicity empire, but this algorithm is not what is being touted.

      Like

  5. I hate the MIC paper: it is a gigantic oversell. It would have been appropriate for Bioinformatics, not even for a statistical journal. MIC is slow, has no sound theoretical justification, its implementation relies on a crude heuristic and is flawed (there are jumps in the significance when steadily increasing the number of points, we have checked that several times), the p-values even go down with increasing sample number if you score fractal curves (use the blancmange curve and the exponent n=0.2 instead of their n=0.6 to see it quickly), and it generally performs worse than the dcor method. Watch me bash the MIC method at Florian’s Workshop this week.

    Like

  6. I forgot to mention: There is a lot of arbitrariness in the choice of the bound B(n) = n^0.6; this is completely unmotivated. Most importantly, their Theroem 3, proving convergence of MIC->1 for continuous noiseless functions, is WRONG. The mistake is in Proposition 6.14, where they claim “There are at most kB(n) pairs of points with consecutive x-values with y-values in different bins of the equipartition, where k is a constant that depends on f.”. This easily proven to be wrong using the (x,sin(1/x)) curve on (0,1]. Those who prefer continuity on the closed [0,1] interval take (x,x*sin(1/x)) – this is a little bit more effort then, but still works. Note that this mistake cannot be mended by adjusting the functional form of B(n). In conclusion, MIC does not even fulfill elementary requirements for a dependence measure.

    Like

    1. “In conclusion, MIC does not even fulfill elementary requirements for a dependence measure.” Also sounds like a good summary sentence for a paper on this topic. Achim: can you still try a response in Science?

      Like

      1. Don’t you think I already have enough enemies? 🙂 Let’s discuss that on Thursday. Maybe if some of the workshop participants like to join me in that response.

        Like

You gotta talk to me!