snputils: A Python library for processing diverse genomes

snputils is a Python package designed to ease the processing and analysis of common and diverse genomic datasets, while handling all the complexities of diverse genome formats and operations very efficiently. The library provides robust tools for handling sequencing and ancestry data, with a focus on performance, ease of use, and advanced visualization capabilities.

Developed in collaboration between Stanford University's Department of Biomedical Data Science, UC Santa Cruz Genomics Institute, and more collaborators worldwide.

This is an early access release, parts of the code are likely to change significantly in the upcoming weeks.

Installation

Basic installation using pip:

pip install snputils

Optionally, for GPU-accelerated functionalities, install the package with the [gpu] extra:

pip install 'snputils[gpu]'

Key Features

Ease of Use

snputils is designed to be user-friendly and intuitive, with a simple API that allows you to quickly load, process, and visualize genomic data. For example, reading a whole genome VCF file is as simple as:

import snputils as su
snpobj = su.read_snp("path/to/file.vcf.gz")

Similarly, reading BED or PGEN filesets is straightforward:

snpobj = su.read_snp("path/to/file.pgen")

Working with ancestry files, performing processing operations, and creating visualizations is just as straightforward. See the demos directory for examples.

File Format Support

snputils aims to provide the fastest available readers and writers for various genomic data formats:

VCF: Support for .vcf and .vcf.gz files
PLINK1: Support for .bed, .bim, .fam filesets
PLINK2: Support for .pgen, .pvar, .psam filesets
Local Ancestry: Handle .msp local ancestry format
Admixture: Read and write .Q and .P files

Processing Tools

Basic Data Manipulation
- Filter variants and samples
- Correct SNP flips
- Filter out ambiguous SNPs
Dimensionality Reduction
- Standard PCA with optional GPU acceleration
- Missing-DNA PCA (mdPCA)
- Multi-array ancestry-specific MDS (maasMDS)
Admixture Mapping

Visualization

Interactive global ancestry bar plots
Detailed scatter plots of PCA, mdPCA, and maasMDS
Admixture mapping Manhattan plots
Local ancestry visualization
- Chromosome painting (with Tagore)
- Dataset-level

Performance

Fast file I/O through built-in methods or optimized wrappers (e.g., Pgenlib for PLINK files)
Memory-efficient operations using NumPy and Polars
Optional GPU acceleration via PyTorch for computationally intensive tasks
Support for large-scale genomic datasets through efficient memory management

Our benchmark demonstrates superior performance compared to existing tools:

Reading performance comparison for chromosome 22 data across different tools. See the benchmark directory for detailed methodology and results.

The snputils package is continuously updated with new features and improvements. Future releases will include support for statistical computations, admixture simulations, command-line tools, and more.

Documentation & Support

API Reference: Visit our comprehensive documentation at docs.snputils.org.
Tutorials & Examples: Check out our demos in the demos directory.
Issues & Support: GitHub Issues.

Acknowledgments

We would like to thank the open-source Python packages that make snputils possible: matplotlib, NumPy, pandas, Pgenlib, polars, pong, PyTorch, scikit-allel, scikit-learn, Tagore.

Citation

If you use snputils in your research, please cite:

Bonet, D.*, Comajoan Cara, M.*, Barrabés, M.*, Smeriglio, R., Agrawal, D., Dominguez Mantes, A., López, C., Thomassin, C., Calafell, A., Luis, A., Saurina, J., Franquesa, M., Perera, M., Geleta, M., Jaras, A., Sabat, B. O., Abante, J., Moreno-Grau, S., Mas Montserrat, D., Ioannidis, A. G., snputils: A Python library for processing diverse genomes. Annual Meeting of The American Society of Human Genetics, November 2024, Denver, Colorado, USA. *Equal contribution.

Journal paper coming soon!

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
assets		assets
benchmark		benchmark
demos		demos
snputils		snputils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snputils: A Python library for processing diverse genomes

Installation

Key Features

Ease of Use

File Format Support

Processing Tools

Visualization

Performance

Documentation & Support

Acknowledgments

Citation

About

Releases 8

Contributors 4

Languages

License

AI-sandbox/snputils

Folders and files

Latest commit

History

Repository files navigation

snputils: A Python library for processing diverse genomes

Installation

Key Features

Ease of Use

File Format Support

Processing Tools

Visualization

Performance

Documentation & Support

Acknowledgments

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 8

Contributors 4

Languages