Publications

What is a Publication?
36 Publications visible to you, out of a total of 36

Abstract (Expand)

The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzingzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing a data revolution in single cell biology. Although some issues are similar in spirit to those experienced in bulk sequencing, many of the emerging data science problems are unique to single cell analysis; together, they give rise to the new realm of 'Single-Cell Data Science'. Here, we outline twelve challenges that will be central in bringing this new field forward. For each challenge, the current state of the art in terms of prior work is reviewed, and open problems are formulated, with an emphasis on the research goals that motivate them. This compendium is meant to serve as a guideline for established researchers, newcomers and students alike, highlighting interesting and rewarding problems in 'Single-Cell Data Science' for the coming years.

Authors: David Laehnemann, Johannes Köster, Ewa Szczurek, Davis J McCarthy, Stephanie C Hicks, Mark D Robinson, Catalina A Vallejos, Niko Beerenwinkel, Kieran R Campbell, Ahmed Mahfouz, Luca Pinello, Pavel Skums, Alexandros Stamatakis, Camille Stephan-Otto Attolini, Samuel Aparicio, Jasmijn Baaijens, Marleen Balvert, Buys de Barbanson, Antonio Cappuccio, Giacomo Corleone, Bas E Dutilh, Maria Florescu, Victor Guryev, Rens Holmer, Katharina Jahn, Thamar Jessurun Lobo, Emma M Keizer, Indu Khatri, Szymon M Kiełbasa, Jan O Korbel, Alexey M Kozlov, Tzu-Hao Kuo, Boudewijn PF Lelieveldt, Ion I Mandoiu, John C Marioni, Tobias Marschall, Felix Mölder, Amir Niknejad, Łukasz Rączkowski, Marcel Reinders, Jeroen de Ridder, Antoine-Emmanuel Saliba, Antonios Somarakis, Oliver Stegle, Fabian J Theis, Huan Yang, Alex Zelikovsky, Alice C McHardy, Benjamin J Raphael, Sohrab P Shah, Alexander Schönhuth

Date Published: 23rd Aug 2019

Publication Type: Journal

Abstract (Expand)

ModelTest-NG is a reimplementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate and introduces several new features, such as ascertainment bias correction, mixture, and free-rate models, or the automatic processing of single partitions. ModelTest-NG is available under a GNU GPL3 license at https://github.com/ddarriba/modeltest , last accessed September 2, 2019.

Authors: Diego Darriba, David Posada, Alexey M Kozlov, Alexandros Stamatakis, Benoit Morel, Tomas Flouri

Date Published: 21st Aug 2019

Publication Type: Journal

Abstract (Expand)

The ever increasing amount of genomic and meta-genomic sequence data has transformed biology into a data-driven and compute-intensive discipline. Hence, there is a need for efficient algorithms and scalable implementations thereof for analysing such data. We present GENESIS, a library for working with phylogenetic data, and GAPPA, an accompanying command line tool for conducting typical analyses on such data. While our tools primarily target phylogenetic trees and phylogenetic placements, they also offer a plethora of functions for handling genetic sequences, taxonomies, and other relevant data types. The tools aim at improved usability at the production stage (conducting data analyses) as well as the development stage (rapid prototyping): The modular interface of GENESIS simplifies numerous standard high-level tasks and analyses, while allowing for low-level customization at the same time. Our implementation relies on modern, multi-threaded C++11, and is substantially more com-putationally efficient than analogous tools. We already employed the core GENESIS library in several of our tools and publications, thereby proving its flexibility and utility. GENESIS and GAPPA are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa.

Authors: Lucas Czech, Pierre Barbera, Alexandros Stamatakis

Date Published: 28th May 2019

Publication Type: Journal

Abstract (Expand)

High-throughput environmental DNA metabarcoding has revolutionized the analysis of microbial diversity, but this approach is generally restricted to amplicon sizes below 500 base pairs. These short regions contain limited phylogenetic signal, which makes it impractical to use environmental DNA in full phylogenetic inferences. However, new long-read sequencing technologies such as the Pacific Biosciences platform may provide sufficiently large sequence lengths to overcome the poor phylogenetic resolution of short amplicons. To test this idea, we amplified soil DNA and used PacBio Circular Consensus Sequencing (CCS) to obtain a ~4500 bp region of the eukaryotic rDNA operon spanning most of the small (18S) and large subunit (28S) ribosomal RNA genes. The CCS reads were first treated with a novel curation workflow that generated 650 high-quality OTUs containing the physically linked 18S and 28S regions of the long amplicons. In order to assign taxonomy to these OTUs, we developed a phylogeny-aware approach based on the 18S region that showed greater accuracy and sensitivity than similarity-based and phylogenetic placement-based methods using shorter reads. The taxonomically-annotated OTUs were then combined with available 18S and 28S reference sequences to infer a well-resolved phylogeny spanning all major groups of eukaryotes, allowing to accurately derive the evolutionary origin of environmental diversity. A total of 1019 sequences were included, of which a majority (58%) corresponded to the new long environmental CCS reads. Comparisons to the 18S-only region of our amplicons revealed that the combined 18S-28S genes globally increased the phylogenetic resolution, recovering specific groupings otherwise missing. The long-reads also allowed to directly investigate the relationships among environmental sequences themselves, which represents a key advantage over the placement of short reads on a reference phylogeny. Altogether, our results show that long amplicons can be treated in a full phylogenetic framework to provide greater taxonomic resolution and a robust evolutionary perspective to environmental DNA.

Authors: Mahwash Jamy, Rachel Foster, Pierre Barbera, Lucas Czech, Alexey Kozlov, Alexandros Stamatakis, David Baß, Fabien Burki

Date Published: 5th May 2019

Publication Type: Journal

Abstract

Not specified

Authors: A. Schmoldt, H. F. Benthe, G. Haberland

Date Published: 1st Sep 1975

Publication Type: Journal

Abstract (Expand)

In recent years, advances in high-throughput genetic sequencing, coupled with the ongoing exponential growth and availability of computational resources, have enabled entirely new approaches in the biological sciences. It is now possible to perform broad sequencing of the genetic content of entire communities of organisms from individual environmental samples. Such methods are particularly relevant to microbiology. The field was previously largely constrained to the study of those microbes that could be cultured in the laboratory (i.e., in vitro), which represents a small fraction of the diversity observed in nature. In contrast to this, high-throughput sequencing now enables the collection of genetic sequences directly from a microbiome in its natural environment (i.e., in situ). A typical goal of microbiome studies is the taxonomic classification of the sequences contained in a sample (the queries). Phylogenetic methods are commonly used to determine detailed taxonomic relationships between queries and well-trusted reference sequences from previously classified organisms. However, due to the high volume (106 to 109) of query sequences produced by high-throughput sequencing based microbiome sampling, accurate phylogenetic tree reconstruction is computationally infeasible. Moreover, currently used sequencing technologies typically produce short query sequences that have limited phylogenetic signal, causing instability in the inference of comprehensive phylogenies. Another common goal of microbiome studies is to quantify the diversity within a sample, as well as between multiple samples. Phylogenetic methods are commonly used for this task as well, typically involving the inference of a phylogenetic tree comprising all query sequences, or a clustered subset thereof. Again, as with taxonomic identification, analyses based on this kind of tree inference may result in inaccurate results, and/or be computationally prohibitive. In contrast to comprehensive phylogenetic inference, phylogenetic placement is a method that identifies the phylogenetic context of a query sequence within a trusted reference tree. Such methods typically regard the reference tree as immutable, that is the reference tree is not altered before, during, or after the placement of a query. This allows for the phylogenetic placement of a query sequence in linear time with respect to the size of the reference tree. When combined with taxonomic information for the reference sequences, phylogenetic placement therefore allows to identify a query. Further, phylogenetic placement enables a wealth of additional post-analysis procedures, allowing for example the association of microbiome characteristics with clinical diagnostic properties. In this thesis I present my work on designing, implementing, and improving EPA-ng, a high-performance implementation of maximum likelihood phylogenetic placement. EPA-ng is designed to scale to billions of input query sequences and to parallelize across thousands of cores in both shared, and distributed memory environments. It also improves the single-core processing speed by up to 30 times compared to its closest direct competitors. Recently, we have introduced an optional feature to EPA-ng that allows for placement into substantially larger reference trees, using an active memory management approach that trades memory for execution time. Additionally, I present a massively parallel approach to quantify the diversity of a sample, based on phylogenetic placement results. the resulting tool, called SCRAPP, combines state-of-the-art methods for maximum likelihood phylogenetic tree inference and molecular species delimitation to infer a species count distribution on a reference tree for a given sample. Furthermore, it employs a novel approach for clustering placement results, allowing the user to reduce the computational effort.

Author: Pierre Barbera

Date Published: No date defined

Publication Type: Doctoral Thesis

Abstract (Expand)

Phylogenetic trees represent hypothetical evolutionary relationships between organisms. Approaches for inferring phylogenetic trees include the Maximum Likelihood (ML) method. This method relies on numerical optimization routines that use internal numerical thresholds. We analyze the influence of these thresholds on the likelihood scores and runtimes of tree inferences for the ML inference tools RAxML-NG, IQ-Tree, and FastTree. We analyze 22 empirical datasets and show that we can speed up the tree inference in RAxML-NG and IQ-Tree by changing the default values of two such numerical thresholds. Using 15 additional simulated datasets, we show that these changes do not affect the accuracy of the inferred phylogenetic trees. For RAxML-NG, increasing the likelihood thresholds lh_epsilon and spr_lh_epsilon to 10 and 103 respectively results in an average speedup of 1.9 ± 0.6. Increasing the likelihood threshold lh_epsilon in IQ-Tree results in an average speedup of 1.3 ± 0.4. In addition to the numerical analysis, we attempt to predict the difficulty of datasets, with the aim of preventing an unnecessarily large number of tree inferences for datasets that are easy to analyze. We present our prediction experiments and discuss why this task proved to be more challenging than anticipated.

Author: Julia Haag

Date Published: No date defined

Publication Type: Master's Thesis

Powered by
(v.1.14.2)
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH