Skip to content

grp-bork/samestr_flow

Repository files navigation

SameStr workflow

Bork Group Logo Developed by the Bork Group
Raise an issue or contact us

See our other Software & Services
Contributors:
The development of this workflow was supported by NFDI4Microbiota NFDI4Microbiota icon

Description

SameStr is a tool for strain-level analysis and microbial tracking with genomic/metagenomic data originally developed in the Fricke Lab at the University of Hohenheim. The SameStr workflow is a nextflow workflow for running SameStr based on MetaPhlAn4 marker alignments, including optional read preprocessing and host/human decontamination steps provided by the nevermore workflow library.

Citation

This workflow: DOI

Also cite:

Podlesny D, Arze C, Dörner E, et al. Metagenomic strain detection with SameStr: identification of a persisting core gut microbiota transferable by fecal transplantation. Microbiome. 2022;10(1):53. Published 2022 Mar 25. doi:10.1186/s40168-022-01251-w

Overview

Nevermore_workflow SameStr_subworkflow


Requirements

The easiest way to handle dependencies is via Singularity/Docker containers. Alternatively, conda environments, software module systems or native installations can be used.

Preprocessing

Preprocessing and QA is done with bbmap, fastqc, and multiqc.

Decontamination/Host removal

Decontamination is done with kraken2 and additionally requires seqtk.

Kraken2 database

Host removal requires a kraken2 host database.

Metaphlan Profiling

The default supported MetaPhlAn version is 4.

CHOCOPhlAn database for Metaphlan4

Get an SGB-based CHOCOPhlAn database from the official Biobakery site. At the time of writing, the following databases are available:

  • mpa_vJan21_CHOCOPhlAnSGB_202103 (has SameStr db)
  • mpa_vOct22_CHOCOPhlAnSGB_202212 (has SameStr db)
  • mpa_vJun23_CHOCOPhlAnSGB_202307 (has SameStr db)
  • mpa_vOct22_CHOCOPhlAnSGB_202403 (not tested)
  • mpa_vJun23_CHOCOPhlAnSGB_202403 (not tested)

To install the database, unpack the tarball and point the --mp4_db parameter to the database's root directory.

In params.yml:

mp4_db: "/path/to/mpa_vOct22_CHOCOPhlAnSGB_202212/"

On the command line:

--mp4_db "/path/to/mpa_vOct22_CHOCOPhlAnSGB_202212/"

SameStr Profiling

Shared strains are detected with SameStr.

SameStr databases

Obtain the SameStr database corresponding to your CHOCOPhlAn database from the Zenodo repository.


Usage

Cloud-based Workflow Manager (CloWM)

This workflow will be available on the CloWM platform (coming soon).

Command-Line Interface (CLI)

The workflow run is controlled by environment-specific parameters (see run.config) and study-specific parameters (see params.yml). The parameters in the params.yml can be specified on the command line as well.

You can either clone this repository from GitHub and run it as follows

git clone https://github.com/grp-bork/samestr_flow.git
nextflow run /path/to/samestr_flow [-resume] -c /path/to/run.config -params-file /path/to/params.yml

Or, you can have nextflow pull it from github and run it from the $HOME/.nextflow directory.

nextflow run grp-bork/samestr_flow [-resume] -c /path/to/run.config -params-file /path/to/params.yml

Input files

Fastq files are supported and can be either uncompressed (but shouldn't be!) or compressed with gzip or bzip2. Sample data must be arranged in one directory per sample.

Per-sample input directories

All files in a sample directory will be associated with the name of the sample folder. Paired-end mate files need to have matching prefixes. Mates 1 and 2 can be specified with suffixes _[12], _R[12], .[12], .R[12]. Lane IDs or other read id modifiers have to precede the mate identifier. Files with names not containing either of those patterns will be assigned to be single-ended. Samples consisting of both single and paired end files are assumed to be paired end with all single end files being orphans (quality control survivors).