
What is a Publication?
9 Publications visible to you, out of a total of 9

Abstract (Expand)

In this work, we present two distinct applications of predictive modeling within the domain of phylogenetic inference and placement. Phylogenetic placements aim to place new entities into a given phylogenetic tree. While there exist efficient implementations for producing phylogenetic placements, the underlying reasons why particular placements are more difficult to perform than others are unknown. In the first use case, we focus on the prediction of the difficulty of those phylogenetic placements. We developed Bold Assertor of Difficulty (BAD). BAD can predict the placement difficulty between 0 (easy) and 1 (hard) with high accuracy. On a set of 3000 metagenomic placements, we obtain a mean absolute error of 0.13. BAD can help biologists understand the challenges associated with placing specific sequences into a reference phylogeny during metagenomic studies based on SHapley Additive exPlanations (SHAP) explanations. Estimating the statistical robustness of the inferred phylogenetic tree constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under maximum likelihood is the Standard, non parametric Felsenstein Bootstrap Support (SBS). The SBS method is computationally costly, leading to the development of alternative approaches such as Rapid Bootstrap and UltraFast Bootstrap 2 (UFBoot2). The second use case of this work is concerned with the fast machine learning-based approxi mation of those SBS values. Our SBS predictor, Educated Bootstrap Guesser (EBG), is on average 9.4 (𝜎 = 5.5) times faster than the major competitor UFBoot2 and provides an SBS estimate with a median absolute error of 5 when predicting SBS values between 0 and 10

Authors: Julius Wiegert, Julia Haag, Dimitri Höhler, Alexandros Stamatakis

Date Published: 7th Apr 2024

Publication Type: Master's Thesis

Abstract (Expand)

Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under Maximum Likelihood (ML) is the Standard, non-parametric Felsenstein Bootstrap Support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the Rapid Bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira–Hasegawalike approximate Likelihood Ratio Test) or the UltraFast Bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or meaningless low branch support intervals (SH-aLRT). Here, we present the Educated Bootstrap Guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 (σ = 5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can predict SBS support values on a phylogeny comprising 1654 SARS-CoV2 genome sequences within 3 hours on a mid-class laptop. EBG is available under GNU GPL3.

Authors: Julius Wiegert, Dimitri Höhler, Julia Haag, Alexandros Stamatakis

Date Published: 6th Mar 2024

Publication Type: Journal

Abstract (Expand)

Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from differentt, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).

Authors: Anastasis Togkousidis, Oleksiy M Kozlov, Julia Haag, Dimitri Höhler, Alexandros Stamatakis

Date Published: 1st Oct 2023

Publication Type: Journal

Abstract (Expand)

Abstract Motivation Simulating Multiple Sequence Alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools, and isluation of phylogenetic inference tools, and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. Results Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition. Data and Code Availability All simulated and empirical MSAs, as well as all analysis results, are available at . All scripts required to reproduce our results are available at and . Contact

Authors: Johanna Trost, Julia Haag, Dimitri Höhler, Laurent Jacob, Alexandros Stamatakis, Bastien Boussau

Date Published: 12th Jul 2023

Publication Type: Journal

Abstract (Expand)

One of the most fundamental unanswered questions that has been bothering mankind during the Anthropocene is whether the use of swearwords in open source code is positively or negatively correlated with source code quality. To investigate this profound matter we crawled and analysed over 3800 C open source code containing English swearwords and over 7600 C open source code not containing swearwords from GitHub. Subsequently, we quantified the adherence of these two distinct sets of source code to coding standards, which we deploy as a proxy for source code quality via the SoftWipe tool developed in our group. We find that open source code containing swearwords exhibit significantly better code quality than those not containing swearwords under several statistical tests. We hypothesise that the use of swearwords constitutes an indicator of a profound emotional involvement of the programmer with the code and its inherent complexities, thus yielding better code based on a thorough, critical, and dialectic code analysis process.

Authors: Jan Strehmel, Ben Bettisworth, Dimitri Höhler, Alexandros Stamatakis

Date Published: 1st Feb 2023

Publication Type: Bachelor's Thesis

Abstract (Expand)

Abstract Phylogenetic analyzes under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees.ultiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.

Authors: Julia Haag, Dimitri Höhler, Ben Bettisworth, Alexandros Stamatakis

Date Published: 1st Dec 2022

Publication Type: Journal

Abstract (Expand)

Abstract Summary The evaluation of phylogenetic inference tools is commonly conducted on simulated and empirical sequence data alignments. An open question is how representative these alignments aretion is how representative these alignments are with respect to those, commonly analyzed by users. Based upon the RAxMLGrove database, it is now possible to simulate DNA sequences based on more than 70, 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on realistic and representative simulated DNA alignments. We simulated 20, 000 MSAs based on representative datasets (in terms of signal strength) from RAxMLGrove, and used 5, 000 datasets from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs all of the analysed tools perform poorly, such that the quicker FastTree2, can constitute a viable alternative to infer trees. We also find, that there are substantial differences between accuracy results on simulated and empirical data, despite the fact that a substantial effort was undertaken to simulate sequences under as realistic as possible settings. Contact Dimitri Höhler,

Authors: Dimitri Höhler, Julia Haag, Alexey M. Kozlov, Alexandros Stamatakis

Date Published: 1st Nov 2022

Publication Type: Journal

Powered by
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH