Simulations of sequence evolution: how (un)realistic they are and why

Abstract:
      Abstract
      
        Motivation
        Simulating Multiple Sequence Alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools, and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical.
      
      
        Results
        Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
      
      
        Data and Code Availability
        
          All simulated and empirical MSAs, as well as all analysis results, are available at
          https://cme.h-its.org/exelixis/material/simulation_study.tar.gz
          . All scripts required to reproduce our results are available at
          https://github.com/tschuelia/SimulationStudy
          and
          https://github.com/JohannaTrost/seqsharp
          .
        
      
      
        Contact
        
          julia.haag@h-its.org

Citation: biorxiv;2023.07.11.548509v2,[Preprint]

Date Published: 12th Jul 2023

Registered Mode: by DOI

Authors: Johanna Trost, Julia Haag, Dimitri Höhler, Laurent Jacob, Alexandros Stamatakis, Bastien Boussau

Citation
Trost, J., Haag, J., Höhler, D., Jacob, L., Stamatakis, A., & Boussau, B. (2023). Simulations of sequence evolution: how (un)realistic they are and why. In []. Cold Spring Harbor Laboratory. https://doi.org/10.1101/2023.07.11.548509
Activity

Views: 1579

Created: 2nd Jan 2024 at 18:25

Last updated: 5th Mar 2024 at 21:25

help Tags

This item has not yet been tagged.

help Attributions

None

Powered by
(v.1.14.2)
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH