Motivation: Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the uncertainty of such an analysis should be reported in all studies. However, to date, there exists no stability estimation technique for genotype data that can estimate this uncertainty.
Results: Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, perindividual support values, and deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques.
Data and Code: Availability Pandora is available on GitHub https://github.com/tschuelia/Pandora. All Python scripts and data to reproduce our results are available on GitHub https://github.com/tschuelia/PandoraPaper.
SEEK ID: https://publications.h-its.org/publications/1833
DOI: 10.1101/2024.03.14.584962
Research Groups: Computational Molecular Evolution
Publication type: Journal
Citation: biorxiv;2024.03.14.584962v1,[Preprint]
Date Published: 15th Mar 2024
Registered Mode: by DOI
Views: 1441
Created: 23rd Apr 2024 at 11:14
Last updated: 23rd Apr 2024 at 11:15
This item has not yet been tagged.
None