Publications

What is a Publication?

104 Publications visible to you, out of a total of 104

Faster Sorting of Aligned DNA-Read Files

Computational Molecular Evolution

Abstract (Expand)

In the analysis of DNA sequencing data for finding disease causing mutations, to understand evolutionary relationships between species, and to find variants, DNA-Reads are compared to a reference genome. … A reference genome is a representative example for a set of genes of a species. Sorting these aligned DNA-Reads by their position within the reference sequence is a crucial step in many of these downstream analyses. SAMtools sort, a widely used tool, performs external memory sorting of aligned DNA-Reads stored in the BAM format (Binary Alignment Map). This format allows for compressed storage of alignment data. SAMtools sort provides the most comprehensive set of features while exhibiting demonstrably faster execution times than its open source alternatives. In this work, we analyze SAMtools sort for sorting BAM files and propose methods to reduce its runtime. We divide the analysis into three parts: management of temporary files, compression, and input/output (IO). For the management of temporary files, we find that the maximum number of temporary files SAMtools sort can open concurrently is lower than the maximum number of open files permitted by the operating system. This results in an unnecessarily high number of merges of temporary files into larger temporary files, introducing overhead as SAMtools sort performs extra write and compression operations. To overcome this, we propose a dynamic limit for the number of temporary files, adapting to the operating system’s soft limit for open files. For compression, we test seven different libraries for compatible compression and a range of compression levels, identifying options that offer faster compression and result in a speedup of up to five times in single-threaded execution of SAMtools sort. For IO, we demonstrate that a minimal level of compression avoids IO overhead, thereby reducing the runtime of SAMtools sort compared to uncompressed output. However, we also show that uncompressed output can be used in the pipelining of SAMtools commands to reduce the runtime of subsequent SAMtools commands. Our proposed modifications to SAMtools sort and user behavior have the potential to achieve speedups of up to 6. This represents an important contribution to the field of bioinformatics, considering the widespread adoption of SAMtools sort evidenced by its over 5,000 citations and over 5.1 million downloads through Bioconda.

Authors: Dominik Siebelt, Lukas Hübner, Alexandros Stamatakis

Date Published: 3rd Jun 2024

Publication Type: Bachelor's Thesis

Citation:

Created: 9th Jan 2025 at 13:05

Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Computational Molecular Evolution

Abstract (Expand)

The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and – in conjunction with … recent advances in ancient DNA sequencing technology – studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical tress, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit , which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson’s f , the Fixation Index, Tajima’s D , pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets. 2012 ACM Subject Classification Applied computing → Computational genomics; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics; Applied computing → Population genetics

Authors: Lukas Hübner, Alexandros Stamatakis

Date Published: 27th May 2024

Publication Type: Proceedings

DOI: 10.1101/2024.05.23.595533

Citation: biorxiv;2024.05.23.595533v1,[Preprint]

Created: 9th Jan 2025 at 10:18, Last updated: 9th Jan 2025 at 10:18

EcoFreq: Compute with Cheaper, Cleaner Energy via Carbon-Aware Power Scaling

Computational Molecular Evolution

Abstract (Expand)

High-performance computing (HPC) constitutes an energy-hungry endeavor, and any efficiency gains via hardware and software advances are quickly (over-)compensated by increased consumption (rebound …

Authors: Oleksiy M. Kozlov, Alexandros Stamatakis

Date Published: 1st May 2024

Publication Type: Proceedings

DOI: 10.23919/ISC.2024.10528928

Citation: ISC High Performance 2024 Research Paper Proceedings (39th International Conference),pp.1-12,IEEE

Created: 9th Jan 2025 at 10:21, Last updated: 9th Jan 2025 at 10:22

Use Cases of Predictive Modeling for Phylogenetic Inference and Placements

Computational Molecular Evolution

(Show All)

Abstract (Expand)

In this work, we present two distinct applications of predictive modeling within the domain of phylogenetic inference and placement. Phylogenetic placements aim to place new entities into a given …

Authors: Julius Wiegert, Julia Haag, Dimitri Höhler, Alexandros Stamatakis

Date Published: 7th Apr 2024

Publication Type: Master's Thesis

Citation:

Created: 9th Jan 2025 at 13:09, Last updated: 9th Jan 2025 at 13:09

Complexity of avian evolution revealed by family-level genomes

Computational Molecular Evolution

(Show All)

Abstract (Expand)

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species …

Authors: Josefin Stiller, Shaohong Feng, Al-Aabid Chowdhury, Iker Rivas-González, David A. Duchêne, Qi Fang, Yuan Deng, Alexey Kozlov, Alexandros Stamatakis, Santiago Claramunt, Jacqueline M. T. Nguyen, Simon Y. W. Ho, Brant C. Faircloth, Julia Haag, Peter Houde, Joel Cracraft, Metin Balaban, Uyen Mai, Guangji Chen, Rongsheng Gao, Chengran Zhou, Yulong Xie, Zijian Huang, Zhen Cao, Zhi Yan, Huw A. Ogilvie, Luay Nakhleh, Bent Lindow, Benoit Morel, Jon Fjeldså, Peter A. Hosner, Rute R. da Fonseca, Bent Petersen, Joseph A. Tobias, Tamás Székely, Jonathan David Kennedy, Andrew Hart Reeve, Andras Liker, Martin Stervander, Agostinho Antunes, Dieter Thomas Tietze, Mads Bertelsen, Fumin Lei, Carsten Rahbek, Gary R. Graves, Mikkel H. Schierup, Tandy Warnow, Edward L. Braun, M. Thomas P. Gilbert, Erich D. Jarvis, Siavash Mirarab, Guojie Zhang

Date Published: 1st Apr 2024

Publication Type: Journal

DOI: 10.1038/s41586-024-07323-1

Citation: Nature

Created: 23rd Apr 2024 at 11:12, Last updated: 23rd Apr 2024 at 11:13

Are Sounds Sound for Phylogenetic Reconstruction?

Computational Molecular Evolution

Abstract (Expand)

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, …

Authors: Luise Häuser, Gerhard Jäger, Johann-Mattis List, Taraka Rama, Alexandros Stamatakis

Date Published: 22nd Mar 2024

Publication Type: Proceedings

Citation: Luise Häuser, Gerhard Jäger, Johann-Mattis List, Taraka Rama, and Alexandros Stamatakis. 2024. Are Sounds Sound for Phylogenetic Reconstruction?. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 78–87, St. Julian's, Malta. Association for Computational Linguistics.

Created: 9th Jan 2025 at 10:29, Last updated: 9th Jan 2025 at 10:29

Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data

Computational Molecular Evolution

(Show All)

Abstract (Expand)

Motivation: Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to …

Authors: Julia Haag, Alexander I. Jordan, Alexandros Stamatakis

Date Published: 15th Mar 2024

Publication Type: Journal

DOI: 10.1101/2024.03.14.584962

Citation: biorxiv;2024.03.14.584962v1,[Preprint]

Created: 23rd Apr 2024 at 11:14, Last updated: 23rd Apr 2024 at 11:15

Publications

Filters ×

Filters