Nextflow pipeline to download and process microbial RNA-seq data from NCBI SRA
- Create the environment with all the requirements with the
nextflow_environment.yaml
file:conda env create -f nextflow_environment.yml --name nextflow
- Install Docker
- Prepare the metadata file for your dataset. Use the download metadata script to get all metadata for a specified organism. To append local data, you can add new rows to the tsv file and fill out the following columns:
Experiment
: For public data, this is your SRX ID. For local data, data should be named with a standardized ID (e.g. ecoli_0001)LibraryLayout
: Either PAIRED or SINGLEPlatform
: Usually ILLUMINA, ABI_SOLID, BGISEQ, or PACBIO_SMRTRun
: One or more SRR numbers referring to individual lanes from a sequencer. This field is empty for local data.R1
: For local data, the complete path to the R1 file. If files are stored on AWS S3, filenames should look likes3://<bucket/path/to>.fastq.gz
.R1
andR2
columns are empty for public SRA data.R2
: Same as R1. This will be empty for SINGLE end sequences.- Convert the tab separated metadata file (.tsv) to a .txt file
- Download your sequence files:
- Download FASTA and GFF3 files for your genome and plasmids (if relevant) from NCBI.
- Put these in a folder named
sequence_files
, and make sure that this folder only contains files for one organism. - Rename the genome files to
genome.fasta
andgenome.gff3
. - Rename plasmid files to
plasmid_<name>.fasta
andplasmid_<name>.gff3
.
- [Optional] Update the following fields in
conf/user.conf
. This can also be entered in the command line:params.organism
: Name of your organism, including strain information if relevantparams.metadata
: File path for your metadata fileparams.sequence_dir
: Location of FASTA/GFF3 files
$ nextflow run main.nf --help
N E X T F L O W ~ version 20.01.0
Launching `main.nf` [special_dalembert] - revision: ef90b5fca3
Usage:
nextflow run main.nf -profile [PROFILE] [ARGS]
Required Arguments:
-profile Executor profile name (e.g. local)
--organism Name of organism
--metadata Path to metadata file
--sequence_dir Directory containing *.fasta and *.gff3 files
Optional Arguments:
--outdir Directory to place outputs
--force Overwrite existing processed data
- Go through the steps described in Setup
- Run Nextflow:
nextflow run main.nf -profile local [ARGS]
Example:
nextflow run main.nf -profile local --organism bacillus_subtilis --metadata ../test/test_metadata.tsv --sequence_dir ../test/sequence_files/ --outdir ../test/nf_results/
- Once it's finished running, you may delete the
work
folder in this root directory to save space.
- Go through the steps described in Setup
- Create a new config file for your cloud service/HPC scheduler (see Nextflow executors)
- Add a new profile in the nextflow.config file.
- Run
nextflow run main.nf -profile [NEW PROFILE] [ARGS]
If you get the error Process requirement exceed available CPUs
or Process requirements exceed available memory
when using -profile local
, then edit conf/local.config
and change the CPU and memory requirements to ensure these are within your local computer's parameters.
If you get the error Cannot invoke method split() on null object
, this means you are missing the R1 and R2 columns from your metadata file.
Other alignment pipelines can be used for alignment and quantification of RNA-seq data. Example of this can include NF-Core. Alignment of eukaryotic organisms is recommended to be done using alternative pipelines.