Computational and statistics papers usually don’t make it to glossy high-impact journals. “Your manuscript seems better suited for a more technical journal” is a regular response for submissions focussing on theory not data.
But sometimes these papers make it through, usually to Science, which has a much better track record for theoretical papers than Nature. An encouraging recent example is Detecting Novel Associations in Large Data Sets by Reshef et al in Science:
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships. *
The project seems to be driven by the Broad Institute and I can’t image that having Eric Lander as a co-author had been a disadvantage for their paper. The Broad even produced a movie to promote their new method. You can download their software including a wrapper for R.
The giants of statistics speak
Mutual information and correlation measures are not what you would call a new field in statistics and data analysis. So what does the establishment think about the new method?
The first responses I saw were really good: Terry Speed from Berkeley wrote a very positive commentary in the same issue of Science calling MIC “a correlation for the 21st century”. And Andrew Gelman from Columbia University wrote a blog post ‘Mr. Pearson, meet Mr. Mandelbrot’ saying “My quick answer is that it looks really cool!” Congratulations! Broad, you did well!
But wait, let’s not get ecstatic quite yet … a recent comment by Noah Simon and Rob Tibshirani from Stanford looks different (thanks to Simply Statistics for the link). It’s short enough to quote it in its entirety (emphasis is mine):
The proposal of Reshef et. al. (“MIC”) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations.
To investigate this, we ran simulations to compare the power of MIC to that of standard Pearson correlation and distance correlation (dcor) Székely & Rizzo (2009). We simulated pairs of variables with different relationships (most of which were considered by the Reshef et. al.), but with varying levels of noise added. To determine proper cutoffs for testing the independence hypothesis, we simulated independent data with the appropriate marginals. As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome. This set of dependencies is by no means exhaustive, however it suggests that MIC has serious power deficiencies, and hence when it is used for large-scale exploratory analysis it will produce too many false positives. The “equitability” property of MIC is not very useful, if it has low power.
We believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use. *
A full R language script for Simon & Tibshirani’s analysis is here. The competing (and less well popularized: no movies!) approach they refer to is Székely and Rizzo, ‘Brownian distance covariance’, Annals of Applied Statistics 2009.
Shoot! Now I’m confused again. Trusting Speed and Gelman I had planned to recommend MIC to all people in my lab, but “serious power deficiencies” doesn’t sound so good!
Maybe it’s best to get out of the way while Lander, Speed, Gelman and Tibshirani battle it out and start testing the different approaches myself.
Let’s see what the final verdict is …
Update Jan 26 2012: Jeff at Simply Statistics takes the MIC paper as a reason to ask When should statistics papers be published in Science and Nature? and discusses the difference between groundbreaking and definitive papers. Glossy journals tend to err on the side of groudbreaking papers, which are good for their impact factor. The same came out in a post I wrote a while ago on scientific de-discovery and forensics.