The Human Cell Atlas needs a pre-registered analysis plan

The Human Cell Atlas preprint came out some days ago on bioRxiv. It describes a project to collect all the cell types in the human body in one big reference map.

Our mission: To create comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease. [from]

The contributors to the project are a Who-is-who of the leaders in single cell genomics and this will be a fantastic data set when it comes out. Because in-depth analysis of resources like this provides the foundation of all biology, as you know.

I enjoyed reading the preprint. It puts the project into a historical perspective and discusses promises as well as limitations. It even references Borges’ `On Rigor in Science’. (I love well-read scientists!) And even if all that means nothing to you, it is still worth reading as a comprehensive summary of the current state-of-the-single-cell-art.

But I kept wondering, with a project like this, how do you know whether it is a success or not? How do you know that your reference map is really comprehensive and covers all (most?) of what it is supposed to find?

Will the data tell me what a cell type is?

The preprint addresses this issue:

At a conceptual level, one challenge is that we lack a rigorous definition of what we mean by the intuitive terms “cell type” and “cell state.” (..) [T]he boundaries between these concepts can be blurred, because cells change over time in ways that are far from fully understood.

Ok, so that is a bit of an issue, isn’t it? If you are going to find cell types without really knowing what they are, how will you do it?

Ultimately, data-driven approaches will likely refine our concepts.

Of course, new data are always good and concepts might need to be adapted.

But … I am far from confident in data-first think-later approaches. Why? Because the data never speak for themselves, it is theories and concepts that make data talk. Without a clear idea of what you are looking for you will find nothing or -even worse- anything.

So back to square one … what is a cell type?

Guilds, clouds and hash functions

The editors of Cell Systems must have shared my ignorance and invited 15 of the world’s leading biologists to answer the question: `What Is Your Conceptual Definition of “Cell Type” in the Context of a Mature Organism?

I understand these experts were only given a couple of hundred words each, so I did not expect a deep discussion of subtleties, but I hoped to find some kind of insight, even if at a rather high level.

However, I only ended up feeling quite frustrated.

In general, all the experts seem to think that cell type is an outdated concept, and several emphasize their contempt by putting the term in inverted commas (never a sign of rigour or confidence).

Some experts believe cell types should be substituted by cell states, a term comfortably more modern and much more dynamic. To be on the safe side, some of them put cell states in inverted commas too (not fully committed either, he?).

Finally, some experts find solace in metaphors and start talking about guilds (I know it’s a thing in ecology) and ecosystems, as well as dynamic personalities, demographies and, wow, clouds. (Now I feel tempted to put stuff in inverted commas myself.)

As if that wasn’t bad enough, an accompanying editorial even proposes to think of cells in terms of hash functions mapping states (or “states”) to phenotypes (or maybe “phenotypes”).

I am not impressed.

If this is the best the experts can do, then the Human Cell Atlas is built on shaky foundations.

How to make the data talk

I am confident the Human Cell Atlas will produce all the data they need and want, but the real science starts with the design of a computational and statistical analysis approach that brings order into all those noisy measurements and yields profound insights into the cell types of the human body.

The authors of the preprint know that too and write:

Unsupervised clustering algorithms for high-dimensional data provide an initial framework, but substantial advances will be needed in order to select the “right” features, “right” similarity metric, and the “right” level of granularity for the question at hand, control for distinct biological processes, handle technical noise, and connect novel clusters with legacy knowledge. [my emphasis]

I completely agree.

And this is why the Human Cell Atlas will need a better theory of cell type to start with. How else will they else know what is right? Or even just “right”?

Unsupervised analyses are a dark art, especially if there are many non-linear knobs to turn. If the only thing we have to guide these unsupervised analyses are some fluffy ideas on cell types (or “types”) or states (or “states”) or maybe dynamic ecological guild clouds or whatever, then the project will have a rather hard time achieving its key objective.

I liked the emphasis the preprint put on computational contributions and the need for new algorithms in the first half of the text, but I am a bit disappointed that these kinds of considerations did not make it to the list of important open discussion points at the end.

Fix the analysis before you touch the human data

I am sure the kind people of the Human Cell Atlas will be only too happy for someone with my track record in single cell genomics (barely any, sorry) to step in and tell them how to do their jobs.

So listen and learn, you guys were almost there:

As with the Human Genome Project, we will also need corresponding atlases for important model organisms, where conserved cell states can be identified and genetic manipulations and other approaches can be used to probe function and lineage. [my emphasis]

Aha, what a great idea!

Model organisms will be key to developing and fine-tuning the concepts and algorithms needed to build a reference map of human cells.

So here is the plan:

  1. Whatever you do to human, do it to worm first. That is, do all the multi-omics spatial whatever profiling you want to do and test them in the worm (I would expect this is happening already).
  2. Now analyse the worm data and see how your ideas of dynamic ecological guild clouds hold up.
  3. Tune the clustering algorithms to maximise the information gained  by comparing to the extensive cell-by-cell knowledge of the worm as a gold standard.
  4. Based on the worm results, fix an analysis plan BEFORE you touch the human data.
  5. Now analyse the human data and report the result without turning all those knobs again to get the most pleasing looking scatter plot.

This plan will definitely help against publication bias and data dredging. (No, dear Human Cell Atlas contributors, I am not saying you are planning to dredge the data on purpose, but safe is safe, right?)

Also, we will end up with some pretty solid definition of what a cell type is, and all those inverted commas and metaphors will be a thing of the past.

Progress at last!




One thought on “The Human Cell Atlas needs a pre-registered analysis plan

  1. I am pretty sure the Human cell atlas will be open data like for example encode. So you can yourself download some single cell RNAseq from many of the open access repositories develop some ground truths, preregistration your analysis and execute on it yourself? What will pre-reg guard against in this instance? I am certain these data will be analyzed and re-analyzed and meta-analyzed for years to come. Because of open data this is essentially a giant exploratory garden right? Any big breaktroughs based on the data will likely be replicated quickly, the dataset will probably be released in freezes and phases allowing for natural replication cycles. If the data is anything like encode, the human genome project, 1000 genomes, GTEx and roadmap epigenetics before it, the major breakthroughs will trickle out over years and years of analysis and data reuse/repurposing. All the while obviously requiring rigor and replication. I am an ardent, if recent, supporter of pre-reg but I wonder if it is applicable here given the nature of the dataset.

You gotta talk to me!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s