Name		Name	Last commit message	Last commit date
parent directory ..
bin		bin
conf		conf
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_environment.yml		nextflow_environment.yml

README.md

Process Raw Data

Nextflow pipeline to download and process microbial RNA-seq data from NCBI SRA

Setup

Create the environment with all the requirements with the nextflow_environment.yaml file:
1. conda env create -f nextflow_environment.yml --name nextflow
Install Docker
Prepare the metadata file for your dataset. Use the download metadata script to get all metadata for a specified organism. To append local data, you can add new rows to the tsv file and fill out the following columns:
1. Experiment: For public data, this is your SRX ID. For local data, data should be named with a standardized ID (e.g. ecoli_0001)
2. LibraryLayout: Either PAIRED or SINGLE
3. Platform: Usually ILLUMINA, ABI_SOLID, BGISEQ, or PACBIO_SMRT
4. Run: One or more SRR numbers referring to individual lanes from a sequencer. This field is empty for local data.
5. R1: For local data, the complete path to the R1 file. If files are stored on AWS S3, filenames should look like s3://<bucket/path/to>.fastq.gz. R1 and R2 columns are empty for public SRA data.
6. R2: Same as R1. This will be empty for SINGLE end sequences.
7. Convert the tab separated metadata file (.tsv) to a .txt file
Download your sequence files:
1. Download FASTA and GFF3 files for your genome and plasmids (if relevant) from NCBI.
2. Put these in a folder named sequence_files, and make sure that this folder only contains files for one organism.
3. Rename the genome files to genome.fasta and genome.gff3.
4. Rename plasmid files to plasmid_<name>.fasta and plasmid_<name>.gff3.
[Optional] Update the following fields in conf/user.conf. This can also be entered in the command line:
1. params.organism: Name of your organism, including strain information if relevant
2. params.metadata: File path for your metadata file
3. params.sequence_dir: Location of FASTA/GFF3 files

Run Nextflow

Nextflow usage

$ nextflow run main.nf --help

N E X T F L O W  ~  version 20.01.0
Launching `main.nf` [special_dalembert] - revision: ef90b5fca3
Usage:

nextflow run main.nf -profile [PROFILE] [ARGS]

Required Arguments:
  -profile              Executor profile name (e.g. local)
  --organism            Name of organism
  --metadata            Path to metadata file
  --sequence_dir        Directory containing *.fasta and *.gff3 files

Optional Arguments:
  --outdir              Directory to place outputs
  --force               Overwrite existing processed data

Running Nextflow locally

Go through the steps described in Setup
Run Nextflow: nextflow run main.nf -profile local [ARGS]

Example:

nextflow run main.nf -profile local --organism bacillus_subtilis --metadata ../test/test_metadata.tsv --sequence_dir ../test/sequence_files/ --outdir ../test/nf_results/

Once it's finished running, you may delete the work folder in this root directory to save space.

Running Nextflow on cloud or high-performance computing

Go through the steps described in Setup
Create a new config file for your cloud service/HPC scheduler (see Nextflow executors)
Add a new profile in the nextflow.config file.
Run nextflow run main.nf -profile [NEW PROFILE] [ARGS]

Common errors

Exceeding requirements

If you get the error Process requirement exceed available CPUs or Process requirements exceed available memory when using -profile local, then edit conf/local.config and change the CPU and memory requirements to ensure these are within your local computer's parameters.

Missing R1/R2 columns

If you get the error Cannot invoke method split() on null object, this means you are missing the R1 and R2 columns from your metadata file.

Pipeline Alternatives

Use of other pipelines

Other alignment pipelines can be used for alignment and quantification of RNA-seq data. Example of this can include NF-Core. Alignment of eukaryotic organisms is recommended to be done using alternative pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2_process_data

2_process_data

README.md

Process Raw Data

Setup

Run Nextflow

Nextflow usage

Running Nextflow locally

Running Nextflow on cloud or high-performance computing

Common errors

Exceeding requirements

Missing R1/R2 columns

Pipeline Alternatives

Use of other pipelines

Files

2_process_data

Directory actions

More options

Directory actions

More options

Latest commit

History

2_process_data

Folders and files

parent directory

README.md

Process Raw Data

Setup

Run Nextflow

Nextflow usage

Running Nextflow locally

Running Nextflow on cloud or high-performance computing

Common errors

Exceeding requirements

Missing R1/R2 columns

Pipeline Alternatives

Use of other pipelines