Do you remember the first post in this series, where we stated the intra-tumor phylogeny problem? No worries, if not – here it is again: Given a sample of the genomes of a heterogeneous tumor, identify the genetic clones and infer their evolutionary relationships.
Finally it’s time to announce our own approach to this problem, which has just come out in Genome Biology:
Yuan*, Sakoparnig*, Markowetz^, Beerenwinkel^. (2015). BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biology 2015, 16:36
BitPhylogeny stands for `Bayesian inference for intra-tumor phylogenies’ – I am sure we mention this in the paper somewhere but am hard pressed to put my finger on where exactly this is. Well, now you know. Links to the code are for example here on my webpage.
How does BitPhylogeny work?
Infinite mixture models
The starting point is the general and well-known framework of mixture models, in which we assume that the sample of genomic markers is a mixture of N clones, which appear with different frequencies (see Fig. 1).
But in contrast to the finite mixture model of Figure 1, BitPhylogeny uses a non-parametric Bayesian formulation (called a Dirichlet process) that automatically infers the number of clones (so we don’t have to specify this in advance) and -even better- connects the clones in a tree (so we also know their evolutionary relationships). And all of that in a single, joint approach!
The secret to success is called a ‘tree structured stick-breaking process’ and was introduced a couple of years ago in this paper: Adams, Ghahramani, Jordan, Tree-Structured Stick Breaking for Hierarchical Data, NIPS 2010 (or just Adams 2010 for short).
Trees are just broken sticks
Let’s do this step by step and first understand how to generate infinite mixture models. The idea is very simple. Think of all your data as a stick of length 1 (see upper panel in Figure 2). Now start breaking random pieces off (using a Beta distributed random variable). Each piece represents one mixture component (aka clone) and its length is the $pi$ in the mixture model equation in Figure 1. Since we break of pieces randomly, each execution of the process will generate a different number of differently sized mixture components/clones. With this generative process we can now use methods like MCMC to infer the number and sizes of components best supported by the data.
Tree structured stick breaking is an extension of this scheme. It introduces a second step. Every time you have broken off a piece (step 1), you break the remaining bit into several other pieces (step 2). You then break a piece off each of them and iterate the two steps. Again, MCMC is your friend to estimate the best tree.
The following figure (taken from Adams 2010) gives an overview of stick breaking and tree-structured stick breaking.
And if you want to have the whole thing explained again by a master, here is a link to Ryan Adams original talk at NIPS 2010:
|Tree-Structured Stick Breaking for Hierarchical Data presented by Ryan Prescott Adams at NIPS 2010. Watch out for the lawnmowers!|
I hope you watched the video until 10:35, where Adams has applied his method to image data and asks ‘What is at the top of the tree? Maybe something deeply philosophically interesting?’ Or maybe not, since in his case it’s lawnmowers — which gives him a big laugh from the audience.
Evolutionary models live on the edges of the tree
The problem is that Adams lacks a clear idea what the tree represents, except that there somehow is a hierarchy of some kind in the data. For us this is different. We model cancer evolution and have a pretty good idea what the trees should mean (as you can see from all the previous posts in this series).
In a classical Dirichlet mixture model you would sample a parameter $theta$ for every $pi$ independently. In a tree-structured mixture model, however, the parameters depend on each other via a `transition kernel’ (technical jargon, I know). A transition kernel is a conditional distribution that tells you how the parameters for each node look like given the parameters you have in the parent node. This is where you can encode all the information/assumptions you have about the process underlying the tree.
For example you can encode that SNVs that were present in the parent node will not be lost again. Our paper describes customized transition kernels for SNVs and for methylation changes, but many others are possible and we actively work on a version for copy number data.
The importance of phased data
What I have described so far looks pretty close to Quaid Morris’ PhyloSub paper. We had been working along the same lines for a while, but Quaid was quite a bit faster than we were. If you put the two approaches side by side, you will find lots of technical differences in how we set up the model and the evolutionary part (improvements we call them).
The biggest difference however is our choice of data. The frequencies of individual SNVs do only contain limited information about evolutionary history (a fact extensively discussed in Quaid’s paper), which is why we chose different types of data in our approach. We used methylation data and single cell SNV profiles as molecular markers. Both are examples of phased genomic markers and give much more information than unphased SNV frequencies.
Enough for today.
Readers who found this teaser interesting also read the original paper.
Yuan, K., Sakoparnig, T., Markowetz, F., & Beerenwinkel, N. (2015). BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies Genome Biology, 16 (1) DOI: 10.1186/s13059-015-0592-6
Ryan Prescott Adams, Zoubin Ghahramani, & Michael I. Jordan (2010). Tree-Structured Stick Breaking Processes for Hierarchical Data NIPS DOI: http://arxiv.org/abs/1006.1062