Here is the table again that I introduced last time to organise tumor phylogeny approaches along some basic principles:
In the last post we discussed 1a and 1b, now we are off to 2a and 2b.
2a: Multiple mixed samples, unphased
This section conceptually focuses on how to infer cancer phylogenies from single nucleotide variants (SNVs) identified by deep-sequencing a cancer genome.
Modeling multiple samples. From a statistical perspective you need to specify a probability for the state of each SNV in the sample given its cellular frequency: Pr( sample | cf ). For multiple samples, the simplest assumption is assume data (reads counts) from different samples to be conditionally independent given their cellular frequencies:
Pr ( sample1, sample2, cf1, cf2 ) = Pr( sample1 | cf1 ) Pr( sample2 | cf2 ) Pr( cf1, cf2 ).
Now how to specify Pr(cf1,cf2)? Methods like PyClone assume independence. But in the case of serial blood samples or circulating tumor DNA from plasma there will be temporal dependencies, which you might want to model. And different tumor sites might exhibit spatial dependencies. So if you are really into modelling you might want to put everything together in a big spatio-temporal model. Kriged Kalman-filters anybody?
I plan a future post specifically about details of statistical modeling; the remainder of this one will be more about basic ideas and concepts.
Multiple samples can help identifying SNV clusters. Last time we discussed that one of the first steps in analyzing deep sequencing data from a tumour is to cluster SNVs by frequency, ideally correcting for copy-number changes and LOH, to get clusters which (on average) appear in the same number of cells (i.e. have the same cellular frequency).
Now what if you are unlucky and there two groups of SNVs which both happen to appear with a frequency of 40% each, but in different cells. In the frequency distribution these two sets would sit right on top of each other and you would not be able to distinguish between them.
This is where a second sample can help. If you were unlucky with the first sample, you might be lucky in the second one and the two clusters might appear with different frequencies. Figure 2 shows an example.
Multiple samples can show the progression of disease. Here I am considering a toy example where a cell from a minor subclone in one tumor (think of the primary tumor or a pre-treatment tumor) seeds a second tumor (a metastasis or a post-treatment tumor). Figure 3 has hypothetical clonal evolution trees. You see that the one cell carries its evolutionary history with it. The second tumor doesn’t start `from scratch’ (from the normal tissue) but from an already mutated cancer genome, to which it adds even more mutations.
The right-most panel in Figure 3 compares the SNV frequencies between samples from the two tumors (given the cellular frequencies of clones annotated in the trees).
- Early SNVs (A and B) sit in the top right because they appear in almost all cells of both samples.
- SNV D only appears in 1% of cells in tumor 1 and is thus below the detection limit of most current sequencing technologies, whereas in tumor 2 it appears in all cells. If you don’t know the evolutionary history of the samples you could explain these SNV frequencies in two ways: (i) predict a minor sub-clone in tumor 1 or (ii) assume that an AB cell seeded tumor 2 and D was a very early event in tumor 2 development. In both cases you had evidence to conclude (correctly) that D (but not necessarily A and B) is important for the transition that happened between tumor 1 and 2. You need data from both tumors for this conclusion. Had you only seen data from tumor 2, then A, B, and D would be undistinguishable because they appear all with the same frequency.
- If you don’t know the temporal order between the two samples, you could use the SNV data to infer it. The fact that D, E and F were found in tumor 2 but not tumor 1 would indicate that tumor 2 developed out of tumor 1 (cancer accumulates mutations). However, alternative branches in tumor 1 development (like C) can complicate this approach.
Multiple samples can help to find forks in trees. In Jiao et al 2014, Quaid Morris and his team describe topological constraints for evolutionary trees. I have summarized their examples in this table (using the numbers from their Figure 1):
The table illustrates three principles of building phylogenetic trees from SNV frequencies:
- With only a single sample, clones can always be ordered linearly (Sample 1).
- They can also be arranged as a fork, unless the sum of frequencies of child nodes is larger than the frequency of the parent node (Sample 1′; B+C = 60%+40% > 80% = A). Quaid calls this the ‘sum rule’. The same idea is called the pigeon hole principle in Nik-Zainal et al (2012) and is also implied in the constraints used by Strino et al (2013). While this constraint applies already to a single sample, additional samples increase the chance to be able to use it.
- The last constraint is specific to multiple samples: Quaid and his team describe the ‘crossing rule’ where a change in frequencies between two samples can only be explained by the clones sitting in independent branches (Sample 1+2). In this example here there would be two linear chains: A -> B -> C in sample 1 and A -> C -> B in sample 2, which is a contradiction to the assumption that we observe the same evolutionary process in both samples. The only possible solution is to conclude that B and C live in separate branches of a fork below A.
The pigeon-hole principle also entails that if two separate clusters indeed sit on top of each other (like in Figure 2) their cellular frequencies must be below 50%. Else there would be a cell carrying both SNVs and the two clusters were in fact only a single cluster.
Multiple samples can be the nodes of a tree. Another advantage of having several samples per tumour/patient is that you can use them directly for tree-building, even if the resolution of the data is not high enough to reveal the clonal composition of each sample. In Schwarz et al (2014) we develop a distance-based approach to infer a phylogenetic tree from copy-number profiles of multiple tumour samples.
Comparing copy-number profiles is challenging (and in particular much more challenging than counting SNVs) because
- copy-number aberrations come in all sizes. Comparing two genomes base-by-base without taking the size of aberrations into account (in technical jargon: the horizontal dependencies) can give you a completely wrong estimate of how many changes happened between two genomes.
- copy-number aberrations can show complex cascading and overlapping patterns, which makes counting the number of changes even harder.
In our paper we show how to use finite-state transducers to tackle these two problems. This is a machine that runs along two genomes to `translate’ one into the other and which in the process counts the minimal number of changes required to do so. The method is called MEDICC for Minimum Event Distance for Intra-tumour Copy-number Comparisons.
Our approach also phases copy-number variants by assigning them to one of the two physical alleles such that the overall evolutionary distance is minimal (this is a heuristic, but it works). We also introduce summary statistics of tumor evolution that can be used (and are being used in a so far unpublished follow-up paper) to link tumor evolution to patient outcome.
The following figure shows an example of a tumor evolution tree for a patient with endometrioid cancer.
2b: Multiple mixed samples, phased
Section 2a turned out to be longer than expected, so it might just as well be a good thing that the only paper I can right now think of to discuss here is Sottoriva et al (2013), which was already in the last post.
I am happy about all suggestions what other papers and approaches fit into this section.
That’s all for now.
Acknowledgements: Thanks to Ke Yuan and Geoff Macintyre for feedback on drafts of this post. I have reused several of their ideas in the final version.