Publications

What is a Publication?
153 Publications visible to you, out of a total of 153

Abstract

Not specified

Authors: Diego Darriba, Tomas Flouri, Alexandros Stamatakis

Date Published: 2015

Publication Type: Journal

Abstract (Expand)

In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available two independent and highly optimized open-source implementations for the pair-wise statistical alignment model, also known as TKF91, that was developed by Thorne, Kishino, and Felsenstein in 1991. This paper has two parts. In the educational part, we cover teaching issues regarding the setup of the course and the practical and summarize student and teacher experiences. In the scientific part, the two student teams (Team I: Nikolai, Sebastian, Daniel; Team II: Sarah, Pierre) present their solutions for implementing efficient and numerically stable implementations of the TKF91 algorithm. The two teams worked independently on implementing the same algorithm. Hence, since the implementations yield identical results -with slight numerical deviations- we are confident that the implementations are correct. We describe the optimizations applied and make them available as open-source codes in the hope that our findings and software will be useful to the community as well as for similar programming practicals at other universities.

Authors: Nikolai Baudis, Pierre Barbera, Sebastian Graf, Sarah Lutteropp, Daniel Opitz, Tomas Flouri, Alexandros Stamatakis

Date Published: 2015

Publication Type: Journal

Abstract

Not specified

Authors: Lucas Czech, Alexandros Stamatakis

Date Published: 2015

Publication Type: Journal

Abstract

Not specified

Authors: A. Schmoldt, H. F. Benthe, G. Haberland

Date Published: 1st Sep 1975

Publication Type: Journal

Abstract (Expand)

In recent years, advances in high-throughput genetic sequencing, coupled with the ongoing exponential growth and availability of computational resources, have enabled entirely new approaches in the biological sciences. It is now possible to perform broad sequencing of the genetic content of entire communities of organisms from individual environmental samples. Such methods are particularly relevant to microbiology. The field was previously largely constrained to the study of those microbes that could be cultured in the laboratory (i.e., in vitro), which represents a small fraction of the diversity observed in nature. In contrast to this, high-throughput sequencing now enables the collection of genetic sequences directly from a microbiome in its natural environment (i.e., in situ). A typical goal of microbiome studies is the taxonomic classification of the sequences contained in a sample (the queries). Phylogenetic methods are commonly used to determine detailed taxonomic relationships between queries and well-trusted reference sequences from previously classified organisms. However, due to the high volume (106 to 109) of query sequences produced by high-throughput sequencing based microbiome sampling, accurate phylogenetic tree reconstruction is computationally infeasible. Moreover, currently used sequencing technologies typically produce short query sequences that have limited phylogenetic signal, causing instability in the inference of comprehensive phylogenies. Another common goal of microbiome studies is to quantify the diversity within a sample, as well as between multiple samples. Phylogenetic methods are commonly used for this task as well, typically involving the inference of a phylogenetic tree comprising all query sequences, or a clustered subset thereof. Again, as with taxonomic identification, analyses based on this kind of tree inference may result in inaccurate results, and/or be computationally prohibitive. In contrast to comprehensive phylogenetic inference, phylogenetic placement is a method that identifies the phylogenetic context of a query sequence within a trusted reference tree. Such methods typically regard the reference tree as immutable, that is the reference tree is not altered before, during, or after the placement of a query. This allows for the phylogenetic placement of a query sequence in linear time with respect to the size of the reference tree. When combined with taxonomic information for the reference sequences, phylogenetic placement therefore allows to identify a query. Further, phylogenetic placement enables a wealth of additional post-analysis procedures, allowing for example the association of microbiome characteristics with clinical diagnostic properties. In this thesis I present my work on designing, implementing, and improving EPA-ng, a high-performance implementation of maximum likelihood phylogenetic placement. EPA-ng is designed to scale to billions of input query sequences and to parallelize across thousands of cores in both shared, and distributed memory environments. It also improves the single-core processing speed by up to 30 times compared to its closest direct competitors. Recently, we have introduced an optional feature to EPA-ng that allows for placement into substantially larger reference trees, using an active memory management approach that trades memory for execution time. Additionally, I present a massively parallel approach to quantify the diversity of a sample, based on phylogenetic placement results. the resulting tool, called SCRAPP, combines state-of-the-art methods for maximum likelihood phylogenetic tree inference and molecular species delimitation to infer a species count distribution on a reference tree for a given sample. Furthermore, it employs a novel approach for clustering placement results, allowing the user to reduce the computational effort.

Author: Pierre Barbera

Date Published: No date defined

Publication Type: Doctoral Thesis

Abstract (Expand)

Phylogenetics, the study of evolutionary relationships among biological entities, plays an essential role in biological and medical research. Its applications range from answering fundamental questions, such as understanding the origin of life, to solving more practical problems, such as tracking pandemics in real time. Nowadays, phylogenetic trees are typically inferred from molecular data, via likelihood-based methods. Those methods strive to find the tree that maximizes a likelihood score under a given stochastic model of sequence evolution. This work focuses on the inference of species as well as gene phylogenetic trees. Species evolve through speciation and extinction events. Genes evolve through events such as gene duplication, gene loss, and horizontal gene transfer. Both processes are strongly correlated, because genes belong to species and evolve within their genomes. One can deploy models of gene evolution and to exploit this correlation between species and gene evolutionary histories, in order to improve the accuracy of phylogenetic tree inference methods. However, the most widely used phylogenetic tree inference methods disregard these phenomena and focus on models of sequence evolution only. In addition, current maximum likelihood methods are computationally expensive. This is particularly challenging as the community faces a dramatically growing amount of available molecular data, due to recent advances in sequencing technologies. To handle this data avalanche, we urgently need tools that offer faster algorithms, as well as efficient parallel implementations. In this thesis, I develop new maximum likelihood methods, that explicitly model the relationships between species and gene histories, in order to infer more accurate phylogenetic trees. Those methods employ both, new heuristics, and dedicated parallelization schemes, in order to accelerate the inference process. My first project, ParGenes, is a parallel software pipeline for inferring gene family trees from a set of per-gene multiple sequence alignments. For each input alignment, it determines the best-fit model of sequence evolution, and subsequently searches for the gene family tree with the highest likelihood under this model. To this end, ParGenes uses several state-of-the-art tools, and runs them in parallel using a novel scheduling strategy. My second project, SpeciesRax, is a method for inferring a rooted species tree from a set of unrooted gene family trees. SpeciesRax strives to find the rooted species tree that maximizes the likelihood score under a dedicated model of gene evolution, that accounts for gene duplication, gene loss, and horizontal gene transfer. In addition, I introduce a new method for assessing the confidence in the resulting species tree, as well as a novel method for estimating its branch lengths. My third project, GeneRax, is a novel maximum likelihood method for gene family tree inference. GeneRax takes as input a rooted species tree as well as a set of (per-gene) multiple sequence alignments, and outputs one gene family tree per input alignment. To this end, I introduce the so-called joint likelihood function, which combines both, a model of sequence evolution, and a model of gene evolution. In addition, GeneRax can estimate the pattern of gene duplication, gene loss, and horizontal gene transfer events that occured along the input species tree.

Author: Benoit Morel

Date Published: No date defined

Publication Type: Doctoral Thesis

Powered by
(v.1.15.2)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH