Publications

What is a Publication?
153 Publications visible to you, out of a total of 153

Abstract (Expand)

Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.

Authors: Luise Häuser, Gerhard Jäger, Alexandros Stamatakis

Date Published: 28th Jun 2024

Publication Type: Proceedings

Abstract (Expand)

The Message-Passing Interface (MPI) and C++ form the backbone of high-performance computing and algorithmic research in the field of distributed-memory computing, but MPI only provides C and Fortran bindings. This provides good language interoperability, but higher-level programming languages make development quicker and less error-prone. We propose novel C++ language bindings designed to cover the whole range of abstraction levels from low-level MPI calls to convenient STL-style bindings, where most parameters are inferred from a small subset of the full parameter set. This allows for both rapid prototyping and fine-tuning of distributed code with predictable runtime behavior and memory management. Using template-metaprogramming, only code paths required for computing missing parameters are generated at compile time, which results in (near) zero-overhead bindings.

Authors: Kunal Agrawal, Erez Petrank, Demian Hespe, Lukas Hübner, Florian Kurpicz, Peter Sanders, Matthias Schimek, Daniel Seemaier, Tim Niklas Uhl

Date Published: 17th Jun 2024

Publication Type: Proceedings

Abstract (Expand)

In the analysis of DNA sequencing data for finding disease causing mutations, to understand evolutionary relationships between species, and to find variants, DNA-Reads are compared to a reference genome. A reference genome is a representative example for a set of genes of a species. Sorting these aligned DNA-Reads by their position within the reference sequence is a crucial step in many of these downstream analyses. SAMtools sort, a widely used tool, performs external memory sorting of aligned DNA-Reads stored in the BAM format (Binary Alignment Map). This format allows for compressed storage of alignment data. SAMtools sort provides the most comprehensive set of features while exhibiting demonstrably faster execution times than its open source alternatives. In this work, we analyze SAMtools sort for sorting BAM files and propose methods to reduce its runtime. We divide the analysis into three parts: management of temporary files, compression, and input/output (IO). For the management of temporary files, we find that the maximum number of temporary files SAMtools sort can open concurrently is lower than the maximum number of open files permitted by the operating system. This results in an unnecessarily high number of merges of temporary files into larger temporary files, introducing overhead as SAMtools sort performs extra write and compression operations. To overcome this, we propose a dynamic limit for the number of temporary files, adapting to the operating system’s soft limit for open files. For compression, we test seven different libraries for compatible compression and a range of compression levels, identifying options that offer faster compression and result in a speedup of up to five times in single-threaded execution of SAMtools sort. For IO, we demonstrate that a minimal level of compression avoids IO overhead, thereby reducing the runtime of SAMtools sort compared to uncompressed output. However, we also show that uncompressed output can be used in the pipelining of SAMtools commands to reduce the runtime of subsequent SAMtools commands. Our proposed modifications to SAMtools sort and user behavior have the potential to achieve speedups of up to 6. This represents an important contribution to the field of bioinformatics, considering the widespread adoption of SAMtools sort evidenced by its over 5,000 citations and over 5.1 million downloads through Bioconda.

Authors: Dominik Siebelt, Lukas Hübner, Alexandros Stamatakis

Date Published: 3rd Jun 2024

Publication Type: Bachelor's Thesis

Abstract (Expand)

The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and – in conjunction with recent advances in ancient DNA sequencing technology – studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical tress, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit , which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson’s f , the Fixation Index, Tajima’s D , pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets. 2012 ACM Subject Classification Applied computing → Computational genomics; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics; Applied computing → Population genetics

Authors: Lukas Hübner, Alexandros Stamatakis

Date Published: 27th May 2024

Publication Type: Proceedings

Abstract (Expand)

High-performance computing (HPC) constitutes an energy-hungry endeavor, and any efficiency gains via hardware and software advances are quickly (over-)compensated by increased consumption (rebound effect). A large proportion of electricity is still generated by burning fossil fuels (mostly coal and gas), which is the cause of climate change, and has other potentially dangerous ecological and political consequences. Moving to a 100% renewable grid requires a plethora of solutions for electricity generation, distribution, storage, and consumption. In particular, dynamic load shifting can better align electricity consumption with the variable availability of solar and wind power. We introduce EcoFreq, a tool for dynamic power scaling on CPUs and GPUs that allows to maximize the usage of renewable, low-carbon energy. We benchmark EcoFreq via 14 HPC workloads using historical electricity market data from Germany, the UK, and the US. We show that carbon-aware power scaling leads to an over-proportional reduction in both, CO 2 emissions and energy costs (e.g., 15% to 19% savings with an induced throughput decrease of only 10%). Furthermore, we observe that simple, static power capping at 70% - 80% results in considerably improved energy efficiency and we hence recommended it as default setting. EcoFreq is freely available at: https://github.com/amkozlov/eco-freq.

Authors: Oleksiy M. Kozlov, Alexandros Stamatakis

Date Published: 1st May 2024

Publication Type: Proceedings

Abstract (Expand)

In this work, we present two distinct applications of predictive modeling within the domain of phylogenetic inference and placement. Phylogenetic placements aim to place new entities into a given phylogenetic tree. While there exist efficient implementations for producing phylogenetic placements, the underlying reasons why particular placements are more difficult to perform than others are unknown. In the first use case, we focus on the prediction of the difficulty of those phylogenetic placements. We developed Bold Assertor of Difficulty (BAD). BAD can predict the placement difficulty between 0 (easy) and 1 (hard) with high accuracy. On a set of 3000 metagenomic placements, we obtain a mean absolute error of 0.13. BAD can help biologists understand the challenges associated with placing specific sequences into a reference phylogeny during metagenomic studies based on SHapley Additive exPlanations (SHAP) explanations. Estimating the statistical robustness of the inferred phylogenetic tree constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under maximum likelihood is the Standard, non parametric Felsenstein Bootstrap Support (SBS). The SBS method is computationally costly, leading to the development of alternative approaches such as Rapid Bootstrap and UltraFast Bootstrap 2 (UFBoot2). The second use case of this work is concerned with the fast machine learning-based approxi mation of those SBS values. Our SBS predictor, Educated Bootstrap Guesser (EBG), is on average 9.4 (𝜎 = 5.5) times faster than the major competitor UFBoot2 and provides an SBS estimate with a median absolute error of 5 when predicting SBS values between 0 and 10

Authors: Julius Wiegert, Julia Haag, Dimitri Höhler, Alexandros Stamatakis

Date Published: 7th Apr 2024

Publication Type: Master's Thesis

Abstract (Expand)

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method, and the choice of genomic regions 1–3. Here, we address these issues by analyzing genomes of 363 bird species 4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a remarkable degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous–Paleogene (K–Pg) boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that challenge modeling due to extreme GC content, variable substitution rates, incomplete lineage sorting, or complex evolutionary events such as ancient hybridization. Assessment of the impacts of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates, and relative brain size following the K–Pg extinction event, supporting the hypothesis that emerging ecological opportunities catalyzed the diversification of modern birds. The resulting phylogenetic estimate offers novel insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.

Authors: Josefin Stiller, Shaohong Feng, Al-Aabid Chowdhury, Iker Rivas-González, David A. Duchêne, Qi Fang, Yuan Deng, Alexey Kozlov, Alexandros Stamatakis, Santiago Claramunt, Jacqueline M. T. Nguyen, Simon Y. W. Ho, Brant C. Faircloth, Julia Haag, Peter Houde, Joel Cracraft, Metin Balaban, Uyen Mai, Guangji Chen, Rongsheng Gao, Chengran Zhou, Yulong Xie, Zijian Huang, Zhen Cao, Zhi Yan, Huw A. Ogilvie, Luay Nakhleh, Bent Lindow, Benoit Morel, Jon Fjeldså, Peter A. Hosner, Rute R. da Fonseca, Bent Petersen, Joseph A. Tobias, Tamás Székely, Jonathan David Kennedy, Andrew Hart Reeve, Andras Liker, Martin Stervander, Agostinho Antunes, Dieter Thomas Tietze, Mads Bertelsen, Fumin Lei, Carsten Rahbek, Gary R. Graves, Mikkel H. Schierup, Tandy Warnow, Edward L. Braun, M. Thomas P. Gilbert, Erich D. Jarvis, Siavash Mirarab, Guojie Zhang

Date Published: 1st Apr 2024

Publication Type: Journal

Powered by
(v.1.16.0)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH