The Human Cell Atlas needs a pre-registered analysis plan

The Human Cell Atlas preprint came out some days ago on bioRxiv. It describes a project to collect all the cell types in the human body in one big reference map.

Our mission: To create comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease. [from]

The contributors to the project are a Who-is-who of the leaders in single cell genomics and this will be a fantastic data set when it comes out. Because in-depth analysis of resources like this provides the foundation of all biology, as you know.

I enjoyed reading the preprint. It puts the project into a historical perspective and discusses promises as well as limitations. It even references Borges’ `On Rigor in Science’. (I love well-read scientists!) And even if all that means nothing to you, it is still worth reading as a comprehensive summary of the current state-of-the-single-cell-art.

But I kept wondering, with a project like this, how do you know whether it is a success or not? How do you know that your reference map is really comprehensive and covers all (most?) of what it is supposed to find?

Will the data tell me what a cell type is?

The preprint addresses this issue:

At a conceptual level, one challenge is that we lack a rigorous definition of what we mean by the intuitive terms “cell type” and “cell state.” (..) [T]he boundaries between these concepts can be blurred, because cells change over time in ways that are far from fully understood.

Ok, so that is a bit of an issue, isn’t it? If you are going to find cell types without really knowing what they are, how will you do it?

Ultimately, data-driven approaches will likely refine our concepts.

Of course, new data are always good and concepts might need to be adapted.

But … I am far from confident in data-first think-later approaches. Why? Because the data never speak for themselves, it is theories and concepts that make data talk. Without a clear idea of what you are looking for you will find nothing or -even worse- anything.

So back to square one … what is a cell type?

Guilds, clouds and hash functions

The editors of Cell Systems must have shared my ignorance and invited 15 of the world’s leading biologists to answer the question: `What Is Your Conceptual Definition of “Cell Type” in the Context of a Mature Organism?

I understand these experts were only given a couple of hundred words each, so I did not expect a deep discussion of subtleties, but I hoped to find some kind of insight, even if at a rather high level.

However, I only ended up feeling quite frustrated.

In general, all the experts seem to think that cell type is an outdated concept, and several emphasize their contempt by putting the term in inverted commas (never a sign of rigour or confidence).

Some experts believe cell types should be substituted by cell states, a term comfortably more modern and much more dynamic. To be on the safe side, some of them put cell states in inverted commas too (not fully committed either, he?).

Finally, some experts find solace in metaphors and start talking about guilds (I know it’s a thing in ecology) and ecosystems, as well as dynamic personalities, demographies and, wow, hard-wired clouds. (Now I feel tempted to put stuff in inverted commas myself.)

As if that wasn’t bad enough, an accompanying editorial even proposes to think of cells in terms of hash functions mapping states (or “states”) to phenotypes (or maybe “phenotypes”).

I am not impressed.

If this is the best the experts can do, then the Human Cell Atlas is built on shaky foundations.

How to make the data talk

I am confident the Human Cell Atlas will produce all the data they need and want, but the real science starts with the design of a computational and statistical analysis approach that brings order into all those noisy measurements and yields profound insights into the cell types of the human body.

The authors of the preprint know that too and write:

Unsupervised clustering algorithms for high-dimensional data provide an initial framework, but substantial advances will be needed in order to select the “right” features, “right” similarity metric, and the “right” level of granularity for the question at hand, control for distinct biological processes, handle technical noise, and connect novel clusters with legacy knowledge. [my emphasis]

I completely agree.

And this is why the Human Cell Atlas will need a better theory of cell type to start with. How else will they know what is right? Or even just “right”?

Unsupervised analyses are a dark art, especially if there are many non-linear knobs to turn. If the only thing we have to guide these unsupervised analyses are some fluffy ideas on cell types (or “types”) or states (or “states”) or maybe dynamic ecological guild clouds or whatever, then the project will have a rather hard time achieving its key objective.

I liked the emphasis the preprint put on computational contributions and the need for new algorithms in the first half of the text, but I am a bit disappointed that these kinds of considerations did not make it to the list of important open discussion points at the end.

Fix the analysis before you touch the human data

I am sure the kind people of the Human Cell Atlas will be only too happy for someone with my track record in single cell genomics (barely any, sorry) to step in and tell them how to do their jobs.

So listen and learn, you guys were almost there:

As with the Human Genome Project, we will also need corresponding atlases for important model organisms, where conserved cell states can be identified and genetic manipulations and other approaches can be used to probe function and lineage. [my emphasis]

Aha, what a great idea!

Model organisms will be key to developing and fine-tuning the concepts and algorithms needed to build a reference map of human cells.

So here is the plan:

  1. Whatever you do to human, do it to worm first. That is, do all the multi-omics spatial whatever profiling you want to do and test them in the worm (I would expect this is happening already).
  2. Now analyse the worm data and see how your ideas of dynamic ecological guild clouds hold up.
  3. Tune the clustering algorithms to maximise the information gained  by comparing to the extensive cell-by-cell knowledge of the worm as a gold standard.
  4. Based on the worm results, fix an analysis plan BEFORE you touch the human data.
  5. Now analyse the human data and report the result without turning all those knobs again to get the most pleasing looking scatter plot.

This plan will definitely help against publication bias and data dredging. (No, dear Human Cell Atlas contributors, I am not saying you are planning to dredge the data on purpose, but safe is safe, right?)

Also, we will end up with some pretty solid definition of what a cell type is, and all those inverted commas and metaphors will be a thing of the past.

Progress at last!



5 thoughts on “The Human Cell Atlas needs a pre-registered analysis plan

  1. I am pretty sure the Human cell atlas will be open data like for example encode. So you can yourself download some single cell RNAseq from many of the open access repositories develop some ground truths, preregistration your analysis and execute on it yourself? What will pre-reg guard against in this instance? I am certain these data will be analyzed and re-analyzed and meta-analyzed for years to come. Because of open data this is essentially a giant exploratory garden right? Any big breaktroughs based on the data will likely be replicated quickly, the dataset will probably be released in freezes and phases allowing for natural replication cycles. If the data is anything like encode, the human genome project, 1000 genomes, GTEx and roadmap epigenetics before it, the major breakthroughs will trickle out over years and years of analysis and data reuse/repurposing. All the while obviously requiring rigor and replication. I am an ardent, if recent, supporter of pre-reg but I wonder if it is applicable here given the nature of the dataset.


    1. Dear Michel, thank you for your comment.

      I see your point. And I am afraid you are right: these data will be first analyzed, then re-analyzed, and finally meta-analyzed.

      I still think that pre-registration is necessary here (if the term sounds too technical I am sure we can find something else). The idea behind pre-registration is (as COS writes at that “the same data cannot be used to generate and test a hypothesis”.

      What we are looking for in the data (= concepts or hypotheses) must not be derived from the data we test them on. Else we overfit.

      And I believe that the very nature of this data set makes an important point about the necessity getting your concepts clear(er) before touching the data (=pre-registration).



      1. Florian, Thanks for your thoughtful response. i fully agree when you say: ‘What we are looking for in the data (= concepts or hypotheses) must not be derived from the data we test them on. Else we overfit.” . My response would be to use this particular data to formulate hypothesis, its after all going to be the first full cell atlas, while I am sure many groups and projects will perform in dept single cell sequencing of specific cell-types/tissues, that data could and should be used to perform confirmatory analysis.


      2. Hi Michel, that’s a way to see it. The HCA is then the discovery data set for future smaller scale follow-up studies. This reminds me of another Borges quote:

        “All statistics, all work that is merely descriptive and informative, imply the ambitious and perhaps groundless hope that in the incalculable future [people] like us, but with clearer minds, will infer from the data that we leave them some useful conclusion or some hidden truth.”

        (From ‘An evening with Ramon Bonavena’ in New Yorker, 1970)

        So the good thing about the Human Cell Atlas is that future scientis with clearer minds will infer some hidden truths from it – I hope the HCA folks are not too disappointed … I thought they had been hoping for profound insights themselves …


  2. Beautiful quote indeed! The profound insights may take time, and will likely come about because of the of the open nature of the data, not in the initial publication (or publications, they’ll likely publish a whole series of papers concurrently like with ENCODE).


You gotta talk to me!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s