Duty Calls, Science

Don’t believe the petabytes! Against Big Data Empiricism


The data never speak for themselves; and even Big Data doesn’t change that.

“The business of Big Data, which involves collecting large amounts of data and then searching it for patterns and new revelations, is the result of cheap storage, abundant sensors and new software. It has become a multibillion-dollar industry in less than a decade,”

writes Quentin Hardy at NYtimes.com. Big Data is everywhere, even in medicine. Just have a look at Atul Butte‘s presentation at TEDMED2012:

“Who needs the scientific method? Vast stores of available data and outsourced research are simply waiting for the right questions,” claims Atul Butte.

Atul is enthusiastic about the wide and free availability of data, in particular genomics data. As he points out, a high-school student could easily download all microarray data ever generated by simply clicking the right buttons at GEO or ArrayExpress. Enterprises like SAGE Bionetworks push this trend forward and hope for a crowd of hackers to take on challenging biomedical problems. A review in Nature Biotech just showed that the field is abuzz with projects to make data and model sharing easier.

I share this enthusiasm! The computational data analyst in me can’t come up with a single reason why having access to more data should be a bad thing (and is happy to forget about genetic privacy for a second).

But the philosopher in me gets gets really nervous when he hears claims that a quantitative increase in data should qualitatively change the way we do science.

What was the question again?

Butte claims that the scientific method (ask question first, then gather data) has been made obsolete by the Big Data revolution. Asking questions first is soooo yesterday! Now we start with the data and then afterwards try to figure out what a sensible question could have been (a setup that reminds me somehow of The Ultimate Question of Life, the Universe and Everything, but let’s better not get distracted quite yet).

Butte’s presentation echoes an article in Wired called The End of Theory: The Data Deluge Makes the Scientific Method Obsolete:

“In short, the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

Ha Ha! How cute! The naivety of it all! But seriously, the next time a petabyte tries to make you say ‘correlation is enough’ — don’t obey!

The reason is simple: Throwing numbers into big and bigger computing cluster is the easy and intellectually boring part of research; the hard and challenging part is how to make sense of the results!

And how do you make sense of things? With models and theories, of course, stupid!

Data without theories are useless

Butte, Wired & Co assume the wikipedia view of how science works: Formulate a question — Hypothesis — Prediction — Test — Analysis. That looks strict and well-structured, but even casual readers of Feyerabend know that science never followed any strict protocol and no monolithic methodology. Scientific methods are now (and have ever been) flexible enough to accommodate a bit more data — even when it comes by the petabyte.

And while “more data” always sounds good, unfortunately it also means “more noise and more junk”. Not every experiment you find on GEO or ArrayExpress went well — and figuring out which data to use and which not, and which correlations are important and which ones are just artifacts is beyond the skills of even the most gifted high-school student. Sorry, Atul.

It is not enough to ‘let the data speak for themselves’, because they can’t! Treating hypotheses and models like some kind of subjective baggage biasing ‘the facts’ is the most simple-minded and silly kind of empiricism and inductivism. This position has long been found lacking; Popper even claimed to have killed it! Philosopher of science Alan Chalmer sums up the situation in his bestselling book ‘What is this thing called science‘:

“Attractive as it may have appeared, (..) the inductivist position is, at best, in need of severe qualification and, at worst, thoroughly inadequate. [T]he facts adequate for science are by no means straightforwardly given but have to be practically constructed, are in some important senses dependent on the knowledge that they presuppose, (..) and are subject to improvement and replacement.”

Data without theories are useless. Theories provide context for data. Theories tell you which data are important and which are junk. Theories are the road maps guiding research. Theories tell you which correlations are interesting enough to follow-up and which are rubbish.

To sum up, the Butte & Wired argument stands on its head! Let me put it back on its feet: Exactly because there are more and more data accessible every day, we need more and better models, theories and hypotheses.

Florian

Image source: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

7 thoughts on “Don’t believe the petabytes! Against Big Data Empiricism

  1. Florian, your post is wonderful. Thank you.

    There is indeed a difference between identifying patterns, leveraging patterns, and making sense of those patterns. The first is something all human beings do naturally, although at varying degrees of success; we simply cannot help ourselves on this. The second is in the domains of the engineers, while the latter belongs to scientists.

    Can computers be used to identify patterns? Yes. Can computers automagically leverage patterns? To some degree. Can computers make sense of those patterns? Not yet.

    One wonderful thing about the Big Data phenomenon and its access tools is that there has been an extension in the number of “human pattern matchers” that can watch for interesting things. Like with commoditization of telescopes, there are now far more amateur astronomers looking up at the sky finding new comets and the like. Do these folks create new theories or models? Some, but the data points and patterns they find are shared with scientists and that is of great value to the scientific community. .

    As I work in the field creating systems that enable the Big Data phenomenon, I often wonder about the impacts of the Big Data capabilities. This ties into your commentary about noise and junk. The amount of synthetic data (data generated by way of algorithms working upon atomic elements) will grow exponentially. If the algorithms being used are “flawed”, then the synthetic data that ends up in the pool of knowledge will likewise be flawed. Models matter. Going one level further, often times data isn’t tagged as “synthetic”, and is treated as fact. At a macro level, this is captured by the euphemism, “Oh, I saw it on the Web, it must be true!” This is happening at the lowest level machinery of data too.

    Again, thank you Florian….. well done.

    -Bill

    Like

  2. Agreed. Without models and theories to test data against we have no structure for making useful inference. AI attempts at unsupervised learning in high-dimensional spaces without structure almost always give disappointing results. Our challenge is to conceive of & provide adequate high-dimensional models that can be tested and identified; ideally using a Bayesian paradigm so that coherent and sensible decisions can be made.

    Like

You gotta talk to me!