Skip to content

Latest commit

 

History

History
87 lines (56 loc) · 8.03 KB

README.md

File metadata and controls

87 lines (56 loc) · 8.03 KB

This beta version of the pipeline is under active development -> Please only use for testing until validation is complete!

ncov2019-artic-nf

A Nextflow pipeline for bioinformatic processing of SARS-CoV-2 using the ARTIC network's fieldbioinformatics tools (https://github.com/artic-network/fieldbioinformatics) for ONT data and a separate workflow for Illumina data, including the improved consensus generation workflow designed by Jared Simpson, and align_trim from the ARTIC fieldbioinformatics pipeline written by Nick Loman and Will Rowe, adapted for use with Illumina data by Sam Wilkinson. The pipeline was originally designed and written by Matt Bull however it is now maintained by Sam Wilkinson (@BioWilko).

Introduction

This Nextflow pipeline originally aimed to automate the ARTIC network nCoV-2019 novel coronavirus bioinformatics protocol however its scope has now expanded to include the processing of Illumina data to assist our short-read-centric colleagues. It is being developed to aid the harmonisation of the analysis of sequencing data generated by the COG-UK project. It will turn SARS-CoV-2 sequencing data (Illumina or Nanopore) into consensus sequences and provide other helpful outputs to assist the project's sequencing centres with submitting data.

Quick-start

As the conda environments involved in this pipeline have become more complicated conda has become exponentially slower, as such we have implemented a Mamba profile. Mamba is a re-implementation of Conda written in C++, it is available at the Mamba Github and can be installed into your base conda environment with conda install mamba -n base -c conda-forge, after which you will be able to use mamba as a drop-in replacement for conda (e.g. mamba install -c bioconda artic). To make use of mamba support in this pipeline Nextflow version >=21.10.0 should be installed, on versions prior to 21.10.0 -profile mamba will default to conda.

Illumina

nextflow run BioWilko/ncov2019-artic-nf [-profile mamba,conda,singularity,docker,slurm,lsf] --illumina --prefix "output_file_prefix" --directory /path/to/reads --schemeVersion "ARTIC primer scheme version"

You can also use cram file input by passing the --cram flag. You can also specify cram file output by passing the --outCram flag.

You can avoid just the cloning of the scheme repository to remain on a fixed revision of it over time by passing --schemeRepoURL file:///path/to/own/clone/of/github.com/artic-network/primer-schemes.

If you wish to use a custom primer scheme you can use the --bed and --ref flags (must be provided together) to provide a custom primer scheme bed file and reference fasta.

Nanopore

Nanopolish

nextflow run BioWilko/ncov2019-artic-nf [-profile mamba,conda,singularity,docker,slurm,lsf] --nanopolish --prefix "output_file_prefix" --basecalled_fastq /path/to/directory --fast5_pass /path/to/directory --sequencing_summary /path/to/sequencing_summary.txt

Medaka

nextflow run BioWilko/ncov2019-artic-nf [-profile mamba,conda,singularity,docker,slurm,lsf] --medaka --medaka_model "medaka-model" --prefix "output_file_prefix" --basecalled_fastq /path/to/directory

Installation

An up-to-date version of Nextflow is required because the pipeline is written in DSL2 and makes use of Mamba compatibility. Following the instructions at https://www.nextflow.io/ to download and install Nextflow should get you a recent-enough version.

Containers

This repo contains both Singularity and Dockerfiles. You can build the Singularity containers locally by running scripts/build_singularity_containers.sh and use them with -profile singularity.

Conda

The repo contains a environment.yml files which automatically build the correct conda env if -profile mamba or -profile conda is specifed in the command. Although you'll need conda or mamba installed, this is probably the easiest way to run this pipeline.

--cache /some/dir can be specified to have a fixed, shared location to store the conda build for use by multiple runs of the workflow.

Executors

By default, the pipeline just runs on the local machine. You can specify -profile slurm to use a SLURM cluster, or -profile lsf to use an LSF cluster. In either case you may need to also use one of the COG-UK institutional config profiles (phw or sanger), or provide queue names to use in your own config file.

Profiles

You can use multiple profiles at once, separating them with a comma. This is described in the Nextflow documentation

Config

Common configuration options are set in conf/base.config. Workflow specific configuration options are set in conf/nanopore.config and conf/illumina.config They are described and set to sensible defaults (as suggested in the nCoV-2019 novel coronavirus bioinformatics protocol)

Options

  • --outdir sets the output directory.
  • --bwa to swap to bwa for mapping (nanopore only).

Workflows

Nanopore

Use --nanopolish or --medaka to run these workflows. --basecalled_fastq should point to a directory created by guppy_basecaller (if you ran with no barcodes), or guppy_barcoder (if you ran with barcodes). It is imperative that the following guppy_barcoder command be used for demultiplexing:

guppy_barcoder --require_barcodes_both_ends -i run_name -s output_directory --arrangements_files "barcode_arrs_nb12.cfg barcode_arrs_nb24.cfg"

If basecalled reads have already been quality filtered the flag --skip_quality_check may be provided so that artic guppyplex does not do so again.

If your fast5 files are compressed with vbz compression (now a default in minKNOW) then the environment variable HDF5_PLUGIN_PATH should be set within the environments/nanopore/environment.yml file to the directory containing your copy of the plugin libvbz_hdf_plugin.so.

Illumina

Briefly, the Illumina workflow can be summarised as follows: Minimap2 -> align_trim -> FreeBayes -> Bcftools. Use --illumina to run the Illumina workflow. Use --directory to point to an Illumina output directory usually coded something like: <date>_<machine_id>_<run_no>_<some_zeros>_<flowcell>. The workflow will recursively grab all fastq files under this directory, so be sure that what you want is in there, and what you don't, isn't!

Important config options are:

Option Description
MinReadLen Minimum read length (before primer trimming) to keep (Default: 105)
MaxReadLen Maximum read length (before primer trimming) to keep (Default: 500)
MinDepth Minimum coverage depth for Freebayes (Default: 10)
FreqThreshold Minimum base proportion at a position to be called umambiguously (Default: 0.75)
MinFreqThreshold Variant bases with proportions lower than this will be discarded (Default: 0.25)

QC

A script to do some basic COG-UK QC is provided in bin/qc.py. This currently tests if >50% of reference bases are covered by >10 reads (Illumina) or >20 reads (Nanopore), OR if there is a stretch of more than 10 Kb of sequence without N - setting qc_pass in <outdir>/<prefix>.qc.csv to TRUE. bin/qc.py can be extended to incorporate any QC test, as long as the script outputs a csv file a "qc_pass" last column, with samples TRUE or FALSE.

Output

A subdirectory for each process in the workflow is created in --outdir. A qc_pass_climb_upload subdirectory containing files important for COG-UK is created.