The data never speak for themselves; and even Big Data doesn’t change that.
“The business of Big Data, which involves collecting large amounts of data and then searching it for patterns and new revelations, is the result of cheap storage, abundant sensors and new software. It has become a multibillion-dollar industry in less than a decade,”
Atul is enthusiastic about the wide and free availability of data, in particular genomics data. As he points out, a high-school student could easily download all microarray data ever generated by simply clicking the right buttons at GEO or ArrayExpress. Enterprises like SAGE Bionetworks push this trend forward and hope for a crowd of hackers to take on challenging biomedical problems. A review in Nature Biotech just showed that the field is abuzz with projects to make data and model sharing easier.
I share this enthusiasm! The computational data analyst in me can’t come up with a single reason why having access to more data should be a bad thing (and is happy to forget about genetic privacy for a second).
But the philosopher in me gets gets really nervous when he hears claims that a quantitative increase in data should qualitatively change the way we do science.
What was the question again?
Butte claims that the scientific method (ask question first, then gather data) has been made obsolete by the Big Data revolution. Asking questions first is soooo yesterday! Now we start with the data and then afterwards try to figure out what a sensible question could have been (a setup that reminds me somehow of The Ultimate Question of Life, the Universe and Everything, but let’s better not get distracted quite yet).
Butte’s presentation echoes an article in Wired called The End of Theory: The Data Deluge Makes the Scientific Method Obsolete:
“In short, the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”
Ha Ha! How cute! The naivety of it all! But seriously, the next time a petabyte tries to make you say ‘correlation is enough’ — don’t obey!
The reason is simple: Throwing numbers into big and bigger computing cluster is the easy and intellectually boring part of research; the hard and challenging part is how to make sense of the results!
And how do you make sense of things? With models and theories, of course, stupid!
Data without theories are useless
Butte, Wired & Co assume the wikipedia view of how science works: Formulate a question — Hypothesis — Prediction — Test — Analysis. That looks strict and well-structured, but even casual readers of Feyerabend know that science never followed any strict protocol and no monolithic methodology. Scientific methods are now (and have ever been) flexible enough to accommodate a bit more data — even when it comes by the petabyte.
And while “more data” always sounds good, unfortunately it also means “more noise and more junk”. Not every experiment you find on GEO or ArrayExpress went well — and figuring out which data to use and which not, and which correlations are important and which ones are just artifacts is beyond the skills of even the most gifted high-school student. Sorry, Atul.
It is not enough to ‘let the data speak for themselves’, because they can’t! Treating hypotheses and models like some kind of subjective baggage biasing ‘the facts’ is the most simple-minded and silly kind of empiricism and inductivism. This position has long been found lacking; Popper even claimed to have killed it! Philosopher of science Alan Chalmer sums up the situation in his bestselling book ‘What is this thing called science‘:
“Attractive as it may have appeared, (..) the inductivist position is, at best, in need of severe qualification and, at worst, thoroughly inadequate. [T]he facts adequate for science are by no means straightforwardly given but have to be practically constructed, are in some important senses dependent on the knowledge that they presuppose, (..) and are subject to improvement and replacement.”
Data without theories are useless. Theories provide context for data. Theories tell you which data are important and which are junk. Theories are the road maps guiding research. Theories tell you which correlations are interesting enough to follow-up and which are rubbish.
To sum up, the Butte & Wired argument stands on its head! Let me put it back on its feet: Exactly because there are more and more data accessible every day, we need more and better models, theories and hypotheses.