Publications

What is a Publication?
153 Publications visible to you, out of a total of 153

Abstract (Expand)

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

Authors: Luise Häuser, Gerhard Jäger, Johann-Mattis List, Taraka Rama, Alexandros Stamatakis

Date Published: 22nd Mar 2024

Publication Type: Proceedings

Abstract (Expand)

Motivation: Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the uncertainty of such an analysis should be reported in all studies. However, to date, there exists no stability estimation technique for genotype data that can estimate this uncertainty. Results: Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, perindividual support values, and deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques. Data and Code: Availability Pandora is available on GitHub https://github.com/tschuelia/Pandora. All Python scripts and data to reproduce our results are available on GitHub https://github.com/tschuelia/PandoraPaper.

Authors: Julia Haag, Alexander I. Jordan, Alexandros Stamatakis

Date Published: 15th Mar 2024

Publication Type: Journal

Abstract (Expand)

Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under Maximum Likelihood (ML) is the Standard, non-parametric Felsenstein Bootstrap Support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the Rapid Bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira–Hasegawalike approximate Likelihood Ratio Test) or the UltraFast Bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or meaningless low branch support intervals (SH-aLRT). Here, we present the Educated Bootstrap Guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 (σ = 5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can predict SBS support values on a phylogeny comprising 1654 SARS-CoV2 genome sequences within 3 hours on a mid-class laptop. EBG is available under GNU GPL3.

Authors: Julius Wiegert, Dimitri Höhler, Julia Haag, Alexandros Stamatakis

Date Published: 6th Mar 2024

Publication Type: Journal

Abstract

Not specified

Authors: Luc Mercatoris, Alexandros Stamatakis

Date Published: 1st Feb 2024

Publication Type: Master's Thesis

Abstract (Expand)

Accurately reconstructing the evolutionary history of a group of organism is a complex task. Current state-of-the-art tools produce phylogenetic tree distributions with Markov chain Monte-Carlo (MCMC) methods by sampling the posterior tree distribution under a given model to reflect uncertainties in the underlying models and data. While these distributions offer very good insight into the phylogenetic history, they are very compute intensive. In this thesis we present and evaluate multiple heuristics to approximate these distributions with distance-based methods. To judge the quality of our heuristics, we compare our distribution against a reference MCMC-based distribution with split and frequency-based metrics. We show that our method works well for some types of data, but not all, compared to other tools, and that further information about the data needs to be incorporated to make this viable in practice. Our most successful method is characterized by the use of pair-wise distance distributions to apply likelihood-supported perturbation to the input distances for the Neighbor Joining algorithm. Because this ignores the interdependencies between distances, we need to add parsimony filtering as a post-processing step to eliminate unlikely trees from our distributions, which significantly improves the results. Finally, we also discuss the shortcomings and future potential of our heuristics to more accurately estimate pair-wise distances and their interdependencies, which should lead to more competitive results.

Authors: Noah Wahl, Benoit Morel, Alexandros Stamatakis

Date Published: 1st Dec 2023

Publication Type: Master's Thesis

Abstract (Expand)

ABSTRACT Motivation Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research inat information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences, but also the processes that generate the gene trees themselves along a shared species tree. To conduct accurate inferences, one needs to account for uncertainty at both levels, that is, in gene trees estimated from inherently short sequences and in their diverse evolutionary histories along a shared species tree. Results We present AleRax, a software that can infer reconciled gene trees together with a shared species tree using a simple, yet powerful, probabilistic model of gene duplication, transfer, and loss. A key feature of AleRax is its ability to account for uncertainty in the gene tree and its reconciliation by using an efficient approximation to calculate the joint phylogenetic-reconciliation likelihood and sample reconciled gene trees accordingly. Simulations and analyses of empirical data show that AleRax is one order of magnitude faster than competing gene tree inference tools while attaining the same accuracy. It is consistently more robust than species tree inference methods such as SpeciesRax and ASTRAL-Pro 2 under gene tree uncertainty. Finally, AleRax can process multiple gene families in parallel thereby allowing users to compare competing phylogenetic hypotheses and estimate model parameters, such as DTL probabilities for genome-scale datasets with hundreds of taxa Availability and Implementation GNU GPL at https://github.com/BenoitMorel/AleRax and data are made available at https://cme.h-its.org/exelixis/material/alerax_data.tar.gz . Contact Benoit.Morel@h-its.org Supplementary information Supplementary material is available.

Authors: Benoit Morel, Tom A. Williams, Alexandros Stamatakis, Gergely J. Szöllősi

Date Published: 7th Oct 2023

Publication Type: Journal

Abstract (Expand)

Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from differentt, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).

Authors: Anastasis Togkousidis, Oleksiy M Kozlov, Julia Haag, Dimitri Höhler, Alexandros Stamatakis

Date Published: 1st Oct 2023

Publication Type: Journal

Powered by
(v.1.16.0)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH