Inferring tumour evolution 3 – Methods for single samples

Series on Tumor Evolution

Figure 1: clonal evolution tree (details in previous post)

In the first post in the series I described a simple toy example to illustrate key concepts of tumour heterogeneity and evolution. A quick summary of the population composition and evolutionary relationsships is displayed in Figure 1 on the right. There are three clones present in the sample A, ABC, ABD characterized by four sets of somatic mutations A, B, C, D.

Our first discovery, when discussing this simple example in the last post, was that classical phylogenetic approaches might not capture important features of cancer evolution. So, which other methods are there to understand the evolution of clones in a tumour?

Principles of inferring tumour evolution

In this post and the next I want to discuss analysis approaches proposed in the last couple of years. Figure 2 organizes research strategies along basic principles, and this post (together with the next one) will discuss examples of each strategy in more detail.

  1. The first principal question is: does your method work on data from a mix of different clones (most of them do) or does it work on single cell genomes?
  2. The second question is: do you know which mutations appear together in the paternal or maternal copy of DNA (‘phased’), or do you have information of individual mutations without knowing the connection between them (`un-phased’)?
  3. Finally, does your method infer evolution from a single tumour sample or by integrating information from many? If you can do it from a single sample, then extending it to many is often straight-forward; but if you need many, you will often not be able to do anything with a single sample.
Figure 2
Figure 2: Strategies to infer tumour evolution organized by basic questions.

Only five entries are filled, because I don’t know a method to infer the evolutionary history of a single cell. I will start by discussing 1a and 1b in this post.

1a: Single mixed sample, unphased

Figure 3 Clustering mutation frequencies hints at population structure of tumor. Bars show how many mutations appear with a certain frequency. I colored the clusters according to which clone in the tree they are specific for. In a real application you would not have this information.

A prominent approach to sample the genomes of clones in a tumour is by sequencing deeply, such that many reads cover each mutation. In our running example this ideally should show that there are four (noisy) clusters of mutations, which are (on average) present in 25%, 40%, 65% and 80% of all cells (see Figure 2). In one of the first examples of this approach, Shah et al 2012 used deep sequencing to measure allelic abundance for 2,414 somatic mutations in triple-negative breast cancer.

Allelic frequency vs cellular frequency. One major challenge in these data is that in genomes as complex as cancer genomes the raw frequency of mutations (the allelic frequency) is not necessarily identical to the number of cells carrying the mutation (the cellular frequency, x-axis in Figure 3). This is why Sohrab Shah’s lab developed a method called PyClone, which uses mixture models to identify clusters of SNVs with the same frequency and at the same time corrects these frequencies for copy-number changes and loss of heterozygosity to estimate the fraction of cells in the tumor carrying these mutations. For copy-number data (instead of SNVs) similar methods exist, for example TITAN (also Shah lab) and THetA (from Ben Raphael’s lab).

Clusters are not clones. Each clonal genome is a combination of some of these mutation clusters. The number of clones is smaller or equal to the number of clusters. For example, there are only 3 clones in the tumour of our running example (A, ABC, ABD; AB died out), characterized by 4 clusters of mutations (A, B, C, D). The number of mutations in each cluster (the size of each hill in the histogram in Figure 3) have nothing to do with the number of cells that carry these mutations.

Clusters are not yet a tree. To relate the clusters to clones you need to order them in a tree. The clonal genomes are then given by the mutations that happened along the path in the tree to this node. For example, the orange triangle in Figure 1 represents clone ABC, because the path from the top (the grey normal cells) consecutively adds mutations A, B and C. To order the clusters into a tree, there are two approaches: either (1) first cluster and then build tree in an independent second step, or (2) joint clustering and tree building in an integrated model.

If you were to follow the first approach you could cluster mutations with PyClone and in our example establish four clusters with frequencies 25%, 40%, 65% and 80%. In a second step you could use this vector of frequencies as input to a method like TrAp from Yuval Kluger’s lab. TrAp solves a highly constrained matrix inversion to reconstruct a tree consistent with the given frequencies. The tree in the ‘life history’ paper is also an example of the first approach, but shows limitations of consecutive clustering and tree-building: In their Figure 3D one of the clusters (called `cluster A’) had to be spread out over 3 positions in the tree, a discrepancy that could hopefully be avoided in an integrated approach.

Quaid Morris’ PhyloSub is an example of the second approach (actually the very first such example). They cluster the mutations with a similar mixture model as PyClone, but relate the parameters for each cluster in a tree structure (using what is called a tree-structured stick breaking process — which is too complex to cover here and hopefully will be the topic of another post.)

Clonal trees from SNV frequencies are generally not unique. Generally, reads are short and will almost exclusively only contain a single SNV, so the only information we have for tree building is the allele (or better: cellular) frequency. One of the first things Quaid and his team realized is that the trees you can reconstruct from frequencies are not necessarily unique. They identified several topological constraints. Take for example the toy tumour we have been discussing. Given the mutation frequencies of A (80%), B (65%), D (40%) and C (25%), a consistent tree could just order clones linearly by mutation frequency into:

A -> AB -> ABD -> ABDC.

This linear tree -incorrectly!- postulates the existence of four clones: A (15%), AB (25%), ABD (15%), ABDC (25%). (The way to compute this is: A exists in 80% of cells, B in 65%. If there is an AB clone, that leaves only 80-65=15% of cells for the A clone.) But given only frequencies of single mutations there is no way to distinguish this tree from the true tree in Figure 1.

SNV and CNA information can help each other: Many methods only look at one type of data, either SNVs or CNAs, but the combination of them can be very powerful. If you have a copy-number gain and you can find the same SNV on all copies you can infer that the SNV event was before the CNA. On the other hand if the SNV is only on one copy you can infer that the SNV was after the CNA (Kudos, Thomas).

Figure 4 The first SNVs (in orange) can be found on all copies of a later amplification, whereas SNVs after the amplification (blue) are only found on individual copies. These observations help to establish an order between SNV and CNA events.

1b: Single mixed sample, phased

Depending on technology reads can be longer and span more than one mutation. The long-read example I know best is Sottoriva et al (2013), who present methylation data from 454 sequencing of the IRX2 molecular clock, a 201bp locus on chromosome 5, which spans 8 CpG regions (potential methylation sites). Every read can be represented as a binary pattern of length 8 (where 1 is methylated and 0 is unmethylated). That is much more information than in the examples above, where every read only carries the information ‘there is an SNV’ or ‘there is no SNV’.

Methylation data has the added advantage that it’s error rate is 10,000-fold higher than that observed for nucleotide substitutions, which gives it a much higher resolution as a marker of cell fate.

The major drawback, if you can call it that, is that methylation is a reversible process, whereas for SNVs you can safely assume that they don’t back-mutate. Thus, for SNVs you see an accumulation of events during tumor development, which makes it easier to infer a direction of the process, which is much harder for methylation (and needs further, often artifical, assumptions, like normal tissue being completely unmethylated).

Our own -still unpublished- approach to infer tumor evolution from methylation data is called BitPhylogeny (for Bayesian intra-tumor phylogeny) and just like Quaid Morris’ PhyloSub it uses a nested stick-break process to sample trees for a mixture model. The code is available at — feel free to try it out, we are happy about any feedback. I will post more details in a future post.

Next post: inferring a tree from multiple samples.


6 thoughts on “Inferring tumour evolution 3 – Methods for single samples

You gotta talk to me!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s