While working on the BitPhylogeny paper, we stumbled on the problem of how to compare trees of clonal evolution.
Clonal evolution trees combine a clustering of molecular markers with tree inference. There are methods to compare clusterings and methods to compare trees, but how do you compare both at the same time?
Here is how we did it:
Idea 1: Consensus node-based shortest-path distance
The idea we published in the BitPhylogeny paper is to restrict the comparison to the nodes that appear in all trees under consideration. For example in the figure below, the node defined by marker ‘C’ only appears in two out of three trees and is thus not taken into account. On the consensus nodes, compare the pairwise shortest-path distances in each tree. Finally compare the distance matrices (or upper-/lower-triangular parts, because they are symmetric) and compute the distance by the sum of absolute values of differences between matrix entries. The following figure gives an overview:
Idea 2: pairwise marker shortest-path distance
While the paper was in review, we came up with other ideas. Here is one that avoids using consensus nodes and uses pairs of markers instead:
- Start by only considering markers which are present in both trees you want to compare (if they both were computed from the same data, then this is generally all of them).
- For every pair of markers, compute their shortest-path distance in each tree: If two markers live in the same node, they have pairwise distance 0. If they live in separate nodes, count the number of edges (regardless of direction) between these nodes. This results in a symmetric marker-times-marker distance matrix for each tree.
- Finally, compare the marker-pair distance matrices by the sum of absolute values of all differences between matrix entries.
The following figure gives an overview of the procedure:
This definition still has some shortcomings, which we plan to address in the future. In particular, it doesn’t take the length of edges (= evolutionary time) into account, which should be easily fixed by computing the pairwise distances on a weighted graph. Additionally, the distance is defined on the undirected graph, but it might be very important whether or not both trees represent the same order of events, so somehow the direction of edges needs to be included.
Distances are badly needed
I think a discussion on how to compare clonal evolution trees is important for everybody who works on how to infer them. How else will we ever be able to compare them with each other or to a `gold standard’ (if such a thing even exists for our problem).
The folks working on the DREAM challenge on tumor heterogeneity must face similar problems. I wonder what they have come up with …
Yuan, K., Sakoparnig, T., Markowetz, F., & Beerenwinkel, N. (2015). BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies Genome Biology, 16 (1) DOI: 10.1186/s13059-015-0592-6