Publications

What is a Publication?
103 Publications visible to you, out of a total of 103

Abstract (Expand)

In dieser Arbeit wird Spearfish, eine neue Methode zur distanzbasierten Inferenz von Genbäumen, entwickelt und getestet. Spearfish verwendet die paarweisen Distanzen der Gensequenzen, sowie die Distanzen der zugehörigen Spezies im Speziesbaum, in einem Clustering-Verfahren, um 10 Genbäume zu rekonstruieren. Der beste wird anschließend mithilfe eines statistischen Evaluierungsverfahrens ausgewählt. Auf allen getesteten simulierten Datensätzen konnte gezeigt werden, dass die von Spearfish inferierten Bäume durchschnittlich eine Distanz von 0,213 zum echten Genbaum besitzen. Damit ist es 2,18-mal genauer als Methoden wie RAxML-NG, welche den Speziesbaum nicht berücksichtigen. Spearfish ist 25,85% ungenauer, aber 49,63% schneller als GeneRax, eine der führenden Methoden, die Genbäume mithilfe ihres Speziesbaumes korrigieren. So kann Spearfish verwendet werden, um Startbäume für GeneRax zu rekonstruieren oder bei goßen Datensätzen sogar zu ersetzen.

Authors: Lukas Knirsch, Benoit Morel, Alexandros Stamatakis

Date Published: 2nd Oct 2025

Publication Type: Bachelor's Thesis

Abstract

Not specified

Authors: Alexander Suhrkamp, Alexandros Stamatakis

Date Published: 1st Dec 2024

Publication Type: Master's Thesis

Abstract

Not specified

Authors: Eric Laudemann, Alexandros Stamatakis

Date Published: 1st Oct 2024

Publication Type: Master's Thesis

Abstract

Not specified

Authors: Erik Borker, Alexandros Stamatakis

Date Published: 1st Sep 2024

Publication Type: Master's Thesis

Abstract (Expand)

In the field of population genetics, the driving forces of evolution within species can be studied with trees. Along a genome, each tree describes the local ancestries of a small genomic region. Together, those trees form a tree sequence that describes the ancestry of a population at every site of the sequence. Inferring tree sequences for whole genomes with many haplotype samples is a computationally expensive task, however. The state-of-the-art tool to infer tree sequences is tsinfer, which infers ancestries for human chromosomes from 5000 samples within a few hours. The tool has the capability to parallelize the computation, but we identify a structure in the input data that limits its parallelizability. We propose a novel parallelization scheme aiming to improve scaling at high thread counts, independently of this structure. Furthermore, we propose several optimizations for the inference algorithm, improving cache efficiency and reducing the number of operations per iteration. We provide a proof-of-concept implementation, and compare the computation speed of our implementation and tsinfer. When inferring ancestries for the 1000 Genomes Project, our implementation is consistently faster by a factor of 1.9 to 2.4. Additionally, depending on the choice of parameters, our parallelization scheme scales better between 32 and 96 cores, improving its speed advantage, especially at higher core counts. In phases where our novel parallelization scheme does not apply, our optimizations still improve the runtime by a factor of 2.2. As available genomic data sets are growing rapidly in size, our contribution decreases the computation time and enables better parallelization, allowing the processing of larger data sets in reasonable time frames

Authors: Johannes Hengstler, Lukas Hübner, Alexandros Stamatakis

Date Published: 1st Aug 2024

Publication Type: Journal

Abstract (Expand)

Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the treerive to determine the tree topology, the branch-lengths, and the evolutionary parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, one should exhibit caution regarding (over-)optimization and the inherent risk of overfitting the model to noisy input data. To this end, we propose, implement, and evaluate four statistical early stopping criteria in RAxML-NG that evade excessive and compute-intensive (over-)optimization. These generic criteria can seamlessly be integrated into other phylo-genetic inference tools while not decreasing tree accuracy. The first two criteria quantify input data-specific sampling noise to derive a stopping threshold. The third, employs the Kishino-Hasegawa (KH) test to statistically assess the significance of differences between intermediate trees before , and after major optimization steps in RAxML-NG. The optimization terminates early when improvements are insignificant. The fourth method utilizes multiple testing correction in the KH test. We show that all early stopping criteria infer trees that are statistically equivalent compared to inferences without early stopping. In conjunction with a necessary simplification of the standard RAxML-NG tree search heuristic, the average inference times on empirical and simulated datasets are ∼3.5 and ∼1.8 times faster, respectively, than for standard RAxML-NG v.1.2. The four stopping criteria have been implemented in RAxML-NG and are available as open source code under GNU GPL at https://github.com/togkousa/raxml-ng .

Authors: Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel

Date Published: 8th Jul 2024

Publication Type: Journal

Abstract (Expand)

Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.

Authors: Luise Häuser, Gerhard Jäger, Alexandros Stamatakis

Date Published: 28th Jun 2024

Publication Type: Proceedings

Powered by
(v.1.15.2)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH