Project A4: Discrete topological optimization techniques for the statistical analysis of tree structures


Inference on tree structures is a fundamental task in many areas of biology, e.g., in the respiratory branching structure, structures of brain arteries or in order to model evolutionary distances of genetic mutations. The comparison and synthesis of different models requires some notion of a distance between trees and a way to select the best (in some sense) average tree from a given set of sample trees. One can work in the so-called BHV tree-space which is composed of locally flat strata and is globally of non-positive curvature. While this space offers many features benign to statistical analysis, e.g. unique means based on L2-distances between trees, in many realistic scenarios, due to the phenomenon of "stickiness", they may feature no asymptotic limiting distribution, putting a dead end to statistical inference and testing.

In the first cohort we have alternatively studied means based on the L1-distances and developed and studied fast algorithms, taking advantage of location theory. In the cohorts to follow, we investigate a biologically more realistic and mathematically more challenging representation of tree-space that, as we anticipate, features less stickiness and is therefore statistically much more rewarding. To this end, we combine statistical tools with methods for optimization on manifolds in order to develop efficient algorithms for computing L2 and L1 mean trees. For instance, these will be based on suitable combinations of distances and objective functionals to obtain a tree which minimizes the geodesic distances to a given sample (distribution) of trees.

Methods: variational analysis in metric spaces, Bayesian methods, dimensionality reduction
Applications: phylogenetic trees for species families and microbiomes