diff --git a/docs/lesson03-qc-pre-processing/02-QC-raw-reads.qmd b/docs/lesson03-qc-pre-processing/02-QC-raw-reads.qmd index bf8aac3..69eec62 100644 --- a/docs/lesson03-qc-pre-processing/02-QC-raw-reads.qmd +++ b/docs/lesson03-qc-pre-processing/02-QC-raw-reads.qmd @@ -1,5 +1,5 @@ --- -title: "Quality Control of Raw Reads" +title: "Quality of Raw Reads" --- ## Getting started @@ -44,9 +44,7 @@ Keep this in mind as we continue to navigate the file system, and don't hesitate Before we can start working properly with our data we need to do some quality control to make sure our data is as high quality as possible. There are lots of ways errors could be introduced during the sampling and sequencing process, such as RNA becoming degraded or bases being called incorrectly. -Quality control (QC) has two main steps: -- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to examine the quality of the reads -- [Cutadapt](https://cutadapt.readthedocs.io/en/stable/index.html) to trim off low quality bases and remove adapter sequences +Quality control (QC) has two main steps: - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to examine the quality of the reads - [Cutadapt](https://cutadapt.readthedocs.io/en/stable/index.html) to trim off low quality bases and remove adapter sequences ## Quality control using FastQC @@ -97,7 +95,7 @@ Create the directories `results` inside `cs_course`: mkdir -p cs_course/results ``` -You might want to use `ls` to check those nested directories have been made. +You might want to use `ls` to check those directories have been made. We are going to have lots of outputs so it makes sense to organise them with some more subdirectories. Let's make one called `qc` (for **q**uality **c**ontrol). @@ -365,7 +363,7 @@ In SRR6820491_1 (the forward reads) the overall quality is fairly high, although In SRR6820491_2 (reverse reads) the quality is much more variable though the mean quality remains high. Some reads are dipping into the medium and low quality sections towards the end. -We definitely want to do some trimming to make sure the lower quality bases are removed from our reads. First though, we should also have a look at the "Adapter Content" graph which will show us where adapter sequences occur in the reads. +We definitely want to do some trimming to make sure the lower quality bases are removed from our reads. First though, we should also have a look at the "Adapter Content" graph which will show us where adapter sequences occur in the reads. Adapter sequences are short sequences that are added to the sample to aid during the preparation of the DNA library. They therefore don't tell us anything biologically important and should be removed if they are present in high numbers. They might also be removed in the case of certain applications, such as when the base sequence needs to be particularly accurate. @@ -373,7 +371,6 @@ Adapter sequences are short sequences that are added to the sample to aid during |:----------------------------------:|:----------------------------------:| | ![.](/images/SRR6820491_1_adapter.png){width="700"} | ![.](/images/SRR6820491_2_adapter.png){width="700"} | - These graphs show us that a high proportion of our reads contain adapters, so we'll need to trim these out too. Helpfully, FastQC has identified that the adapters present are Illumina Universal adapters, so we can easily find out what the sequence to trim out will be. ## Trimming reads @@ -406,14 +403,11 @@ For paired-end reads: This output shows us that there is a different usage for single end or paired end reads. For single ended data we need to use: -- `-a` to specify the sequence of the adaptor to be removed -- `-o` to specify the name of the trimmed output -- the name of the raw sequence we want to be trimmed. +- `-a` to specify the sequence of the adaptor to be removed +- `-o` to specify the name of the trimmed output +- the name of the raw sequence we want to be trimmed. -For paired end data, Cutadapt expects: -- two adaptor sequences for trimming, specified by `-a` and `-A` -- two (different) names for the trimmed outputs, specified by `-o` and `-p` -- the two files to be trimmed. +For paired end data, Cutadapt expects: - two adaptor sequences for trimming, specified by `-a` and `-A` - two (different) names for the trimmed outputs, specified by `-o` and `-p` - the two files to be trimmed. There are lots of other options to add too, but these are the compulsory ones. @@ -441,25 +435,25 @@ We should also make a directory to keep our Cutadapt log file in, in case we nee mkdir results/cutadapt ``` - Now we can start constructing our command. Here are the flags and options we'll be using: -| flag/option | meaning | -| ------- | ---------- | -|`-q 20`| Bases need to have a quality phred score of at least 20 to be maintained | -| `-a AGATCGGAAGAG` | The adapter sequence to trim from the forward reads, based on Illumina's Universal Adapters| -| `-A AGATCGGAAGAG` | The sequence to trim from the reverse reads, also based on Illumina's Universal Adapters | -| `-o ` | The name of the output of the trimmed forward reads | -| `-p ` | The name of the output of the trimmed reverse reads | -| `-m 150` | Remove reads shorter than 150bp after trimmming has occured | -| `-j 8` | Run the process on 8 threads (to speed up the process) | +| flag/option | meaning | +|------------------------------|------------------------------------------| +| `-q 20` | Bases need to have a quality phred score of at least 20 to be maintained | +| `-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA` | The adapter sequence to trim from the forward reads, based on Illumina's Universal Adapters | +| `-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT` | The sequence to trim from the reverse reads, also based on Illumina's Universal Adapters | +| `-o ` | The name of the output of the trimmed forward reads | +| `-p ` | The name of the output of the trimmed reverse reads | +| `-m 150` | Remove reads shorter than 150bp after trimmming has occured | +| `-j 8` | Run the process on 8 threads (to speed up the process) | By default, Cutadapt will print updates on its progress to the console (like we saw with FastQC). It will also print a summary of how many reads were trimmed or kept, and other useful information about its process. Ideally we want to keep a copy of this summary so we can refer to it in future. -We'll therefore add an instruction to the end of the command telling Cutadapt to 'redirect' its console output to a file. The way to do this is `&> `. This will ensure we keep a good record of what we've been doing. Be careful though! If you get an error in your command you'll have to look inside the log to find out what the error is, as it won't get printed to the console like it usually would. +We'll therefore add an instruction to the end of the command telling Cutadapt to 'redirect' its console output to a file. The way to do this is `&> `. This will ensure we keep a good record of what we've been doing. Be careful though! If you have an error in your command you'll have to look inside the log to find out what the error is, as it won't get printed to the console like it usually would. ::: callout-note ## Multi-line commands + This command is quite long! When typing a long command into your terminal, you can use the `\` character to separate code chunks onto separate lines. This can make your code more readable. ::: @@ -467,8 +461,8 @@ Here's what our final command looks like: ``` {.bash filename="Code"} cutadapt -q 20 \ - -a AGATCGGAAGAG \ - -A AGATCGGAAGAG \ + -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \ + -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \ -o data/trimmed_reads/SRR6820491_1.trim.fastq \ -p data/trimmed_reads/SRR6820491_2.trim.fastq \ -m 150 \ @@ -476,15 +470,26 @@ cutadapt -q 20 \ data/raw_reads/SRR6820491_1.fastq \ data/raw_reads/SRR6820491_2.fastq \ &> results/cutadapt/SRR6820491_fastq.log - ``` +::: callout-note +## Reminder: pasting in the command line + +The adapter sequences we provide are very long, so you may want to copy and paste these sequences from this page into your command. However, you cannot paste in the shell using Ctrl + V like you can usually. Instead you have two options: + +- right click and select `paste` +- hover the mouse pointer over the terminal window and press the mouse middle button + +The same is true for copying in the shell - Ctrl + C will not work. Instead, you will need to highlight whatever you want to copy, right click and select `copy`. + +You can copy and paste the full command if you wish, but we recommend typing it out yourself to help you remember what each of the flags does. +::: When you have submitted the command, your prompt will disappear for a couple of minutes. When it comes back, your command has finished running. Let's open up our log file and see the summary. -```{.bash filename="Code"} +``` {.bash filename="Code"} less results/cutadapt/SRR6820491_fastq.log ``` @@ -515,16 +520,17 @@ Total written (filtered): 2,470,608,528 bp (58.0%) ::: callout-note ## Exercise -Use the output from your Cutadapt command to answer the -following questions. -1. How many reads had adapters in the R1 reads? -2. What total percent of reads were trimmed/filtered? +Use the output from your Cutadapt command to answer the following questions. + +1. How many reads had adapters in the R1 reads? +2. What total percent of reads were trimmed/filtered? -::: {.callout-note collapse="true"} +::: {.callout-note collapse="true"} ## Solution -1. 4,874,180 (34.6%) -2. 58% + +1. 4,874,180 (34.6%) +2. 58% ::: ::: @@ -544,8 +550,7 @@ Great! Now we have some high quality samples to work with as we move onto our ne ::: callout-note ## Bonus Exercise -Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualise the HTML files to see whether your per base sequence quality is higher after -trimming. +Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualise the HTML files to see whether your per base sequence quality is higher after trimming. ::: {.callout-note collapse="true"} ## Solution @@ -574,4 +579,3 @@ Then take a look at the html files in your browser. After trimming and filtering, our overall quality is higher with no quality scores dipping into the red (low quality) zone. We now have a distribution of sequence lengths, and most of the sequences contain no adapters. Some of the metrics are still marked as "failed" (have a red cross next to them) but none that are relevant to our workflow. Cutadapt has worked well! ::: ::: - diff --git a/docs/lesson03-qc-pre-processing/03-rrna-filtering.qmd b/docs/lesson03-qc-pre-processing/03-rrna-filtering.qmd index 9bd281d..86025a8 100644 --- a/docs/lesson03-qc-pre-processing/03-rrna-filtering.qmd +++ b/docs/lesson03-qc-pre-processing/03-rrna-filtering.qmd @@ -1,230 +1,247 @@ --- title: "Ribosomal RNA Filtering" --- - + -Metagenomic sequencing adds another layer to the challenge of assembly! Instead of having one organism to assemble you now have many! Depending on the complexity of a metagenome you could have anywhere from a handful of organisms in a community to thousands. +Now we have some high quality reads to work with, our next step is to filter our reads. We'll use these filtered reads in our lesson on functional analysis. -You no longer have one jigsaw puzzle, but many with all the pieces mixed together. +## Why filter reads? -![](/images/03_genomics_v_metagenomics.png){width="720" fig-align="center" fig-alt="Diagram comparing genomics and metagenomics using the jigsaw metaphor described above."} +We're going to be looking at the functional capacity of our microbial communities AKA what genes are being expressed in what abundances. That means we'll really only be interested in mRNA, the RNA that codes for those proteins. -Many of the communities sequenced using metagenomics contain previously uncultured microbes (often known as microbial dark matter) so they are unlikely to have reference genomes. In addition, you don't usually know what you are sequencing - the community of organisms is unknown. +Our samples were treated to remove free nucleotides and tRNAs as part of the extraction process. They were also enriched for mRNAs using an enrichment kit which binds and removes rRNA fragments. But just to make sure there aren't any rRNA fragments left, we're going to do some more filtering. -Assembling our metaphorical jigsaw will be a challenge. We have many, perhaps thousands, of jigsaws to assemble and no pictures +For our lesson on taxonomic analysis we will use the unfiltered reads, since in this case it doesn't matter whether the reads are mRNA or rRNA. -Luckily there are programs, known as assemblers, that will do this for us! +## Filtering with SortMeRNA -Metagenomic assembly faces additional problems, which means we need an assembler built to handle metagenomes. These additional problems include: +We will use a command line tool called [SortMeRNA](https://sortmerna.readthedocs.io/en/latest/index.html) to do this filtering. -1. Differences in coverage between the genomes, due to differences in abundance across the sample. -2. The fact that different species often share conserved regions. -3. The presence of several strains of a single species in the community +SortMeRNA uses a large RNA database and a complex algorithm to decide whether each read is rRNA or not. It sorts reads into two files - one for rRNA, one for "other" (which in our case is most likely to be mRNA). -The assembly strategy also differs based on the sequencing technology used to generate the raw reads. Here we're using raw data from [Nanopore sequencing](https://nanoporetech.com/applications/dna-nanopore-sequencing) as the basis for this metagenome assembly so we need to use a metagenome assembler appropriate for long-read sequencing. +### Making an output directory -We will be using [Flye](https://github.com/fenderglass/Flye), which is a **long-read** *de novo* assembler for assembling large and complex data with a metagenomic mode. Like all our programs, Flye has been pre-installed onto your instance. +The first thing we should do is make a directory to store our outputs from SortMeRNA. We'll use the same `mkdir` command we used in the last episode. -## Flye is a long-read assembler - -::: callout-important -## Important - -Make sure you're still logged into your cloud instance. If you can't remember how to log on, visit [the instructions from earlier today](./02-QC-quality-raw-reads.qmd#getting-started). -::: - -Navigate to the `cs_course` directory. +First, make sure you're in the `cs_course` directory. ``` {.bash filename="Code"} cd ~/cs_course ``` -Run the `flye` command without any arguments to see a short description of its use: +Now we can make a new directory for SortMeRNA outputs inside our existing results folder. ``` {.bash filename="Code"} -flye +mkdir -p results/sortmerna ``` -``` {.default filename="Output"} -$ flye -usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw | - --nano-corr | --nano-hq ) file1 [file_2 ...] - --out-dir PATH - - [--genome-size SIZE] [--threads int] [--iterations int] - [--meta] [--polish-target] [--min-overlap SIZE] - [--keep-haplotypes] [--debug] [--version] [--help] - [--scaffold] [--resume] [--resume-from] [--stop-after] - [--read-error float] [--extra-params] -flye: error: the following arguments are required: -o/--out-dir -``` +### Building the SortMeRNA command -A full description can be displayed by using the `--help` flag: +We're ready to start thinking about our command! Let's have a look at the help documentation for SortMeRNA. ``` {.bash filename="Code"} -flye --help +sortmrna -h ``` ::: {.callout-note collapse="true"} -## Output --- `flye` help documentation - -``` default -usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw | - --nano-corr | --nano-hq ) file1 [file_2 ...] - --out-dir PATH - - [--genome-size SIZE] [--threads int] [--iterations int] - [--meta] [--polish-target] [--min-overlap SIZE] - [--keep-haplotypes] [--debug] [--version] [--help] - [--scaffold] [--resume] [--resume-from] [--stop-after] - [--read-error float] [--extra-params] - -Assembly of long reads with repeat graphs - -optional arguments: - -h, --help show this help message and exit - --pacbio-raw path [path ...] - PacBio regular CLR reads (<20% error) - --pacbio-corr path [path ...] - PacBio reads that were corrected with other methods (<3% error) - --pacbio-hifi path [path ...] - PacBio HiFi reads (<1% error) - --nano-raw path [path ...] - ONT regular reads, pre-Guppy5 (<20% error) - --nano-corr path [path ...] - ONT reads that were corrected with other methods (<3% error) - --nano-hq path [path ...] - ONT high-quality reads: Guppy5+ or Q20 (<5% error) - --subassemblies path [path ...] - [deprecated] high-quality contigs input - -g size, --genome-size size - estimated genome size (for example, 5m or 2.6g) - -o path, --out-dir path - Output directory - -t int, --threads int - number of parallel threads [1] - -i int, --iterations int - number of polishing iterations [1] - -m int, --min-overlap int - minimum overlap between reads [auto] - --asm-coverage int reduced coverage for initial disjointig assembly [not set] - --hifi-error float [deprecated] same as --read-error - --read-error float adjust parameters for given read error rate (as fraction e.g. 0.03) - --extra-params extra_params - extra configuration parameters list (comma-separated) - --plasmids unused (retained for backward compatibility) - --meta metagenome / uneven coverage mode - --keep-haplotypes do not collapse alternative haplotypes - --scaffold enable scaffolding using graph [disabled by default] - --trestle [deprecated] enable Trestle [disabled by default] - --polish-target path run polisher on the target sequence - --resume resume from the last completed stage - --resume-from stage_name - resume from a custom stage - --stop-after stage_name - stop after the specified stage completed - --debug enable debug output - -v, --version show program's version number and exit - -Input reads can be in FASTA or FASTQ format, uncompressed -or compressed with gz. Currently, PacBio (CLR, HiFi, corrected) -and ONT reads (regular, HQ, corrected) are supported. Expected error rates are -<15% for PB CLR/regular ONT; <5% for ONT HQ, <3% for corrected, and <1% for HiFi. Note that Flye -was primarily developed to run on uncorrected reads. You may specify multiple -files with reads (separated by spaces). Mixing different read -types is not yet supported. The --meta option enables the mode -for metagenome/uneven coverage assembly. - -To reduce memory consumption for large genome assemblies, -you can use a subset of the longest reads for initial disjointig -assembly by specifying --asm-coverage and --genome-size options. Typically, -40x coverage is enough to produce good disjointigs. - -You can run Flye polisher as a standalone tool using ---polish-target option. -``` -::: - -Flye has multiple different options available and we need to work out which ones are appropriate for our dataset. - -The most important thing to know is which program was used to basecall our reads, as this determines which of `(--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw | --nano-corr | --nano-hq )` we choose. In the [Methods section of our source paper](https://environmentalmicrobiome.biomedcentral.com/articles/10.1186/s40793-022-00424-2#Sec2) the authors state that: +## Output --- SortMeRNA seq help documentation -> "Nanopore data were basecalled with GPU guppy v4.0.11 using the high-accuracy model and applying a minimum quality score of 7." - -This means we should use the `--nano-raw` option (ONT regular reads, pre-Guppy5), as the reads were called with a version of Guppy that precedes v5. Guppy is a program used to convert the signals that come out of a sequencer into an actual string of bases. - -This option will be followed by the relative path to the long-read .fastq file. - -We also need to choose how many times we want Flye to 'polish' the data after assembly. Polishing is a way to improve the accuracy of the assembly. The number of rounds of polishing is specified using `-i` or `--iterations`. We will do three rounds of polishing, which is a standard practice (though the default in Flye is one round only). - -The other options are a bit easier: - -- We use `-o` or `--outdir` to specify (using a relative path) where the Flye output should be stored -- We also use the `-t` or `--threads` flag in order to run the assembly on multiple threads (aka running several processes at once) in order to speed it up -- Finally we indicate that the dataset is a metagenome using the `--meta` option - -::: callout-note -## Unused parameters - -There are many parameters that we don't need. Some of these are deprecated and some are only appropriate for certain types of data. Others are useful to allow tweaking to try to further improve an assembly (e.g. `--genome-size` and `--read-error`). +``` {.default filename="Output"} -Most bioinformatics programs have an associated website (which is often a GitHub page) with a whole manual to use the program. The [Flye Manual](https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md) contains a lot of further information about the parameters available. If you're going to try using Flye on your own long-read dataset this is a good place to start. + Usage: sortmerna -ref FILE [-ref FILE] -reads FWD_READS [-reads REV_READS] [OPTIONS]: + [REQUIRED] + --ref PATH Required Reference file (FASTA) absolute or relative path. + Use multiple times, once per a reference file + + --reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ). + Use twice for files with paired reads. + The file extensions are Not important. The program automatically + recognizes the file format as flat/compressed, fasta/fastq + + [COMMON] + --workdir PATH Optional Workspace directory USRDIR/sortmerna/run/ + Default structure: WORKDIR/ + idx/ (References index) + kvdb/ (Key-value storage for alignments) + out/ (processing output) + readb/ (pre-processed reads/index) + + + --kvdb PATH Optional Directory for Key-value database WORKDIR/kvdb + KVDB is used for storing the alignment results. + + --idx-dir PATH Optional Directory for storing Reference index. WORKDIR/idx + + --readb PATH Optional Storage for pre-processed reads WORKDIR/readb/ + + Directory storing the split reads, or the random access index of compressed reads + + --fastx BOOL Optional Output aligned reads into FASTA/FASTQ file + --sam BOOL Optional Output SAM alignment for aligned reads. + + --SQ BOOL Optional Add SQ tags to the SAM file + + --blast STR Optional output alignments in various Blast-like formats + Sample values: '0' - pairwise + '1' - tabular (Blast - m 8 format) + '1 cigar' - tabular + column for CIGAR + '1 cigar qcov' - tabular + columns for CIGAR and query coverage + '1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage, + and strand + + --aligned STR/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned + Directory and file prefix for aligned output i.e. each + output file goes into the specified directory with the given prefix. + The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added. + Both 'dir' and 'pfx' are optional. + The 'dir' can be a relative or an absolute path. + If 'dir' is not specified, the output is created in the WORKDIR/out/ + If 'pfx' is not specified, the prefix 'aligned' is used + Examples: + '-aligned $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta + '-aligned dir_1/apfx' -> $PWD/dir_1/apfx.fasta + '-aligned dir_1/' -> $PWD/aligned.fasta + '-aligned apfx' -> $PWD/apfx.fasta + '-aligned (no argument)' -> WORKDIR/out/aligned.fasta + + --other STR/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other + Directory and file prefix for non-aligned output i.e. each + output file goes into the specified directory with the given prefix. + The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added. + Must be used with 'fastx'. + Both 'dir' and 'pfx' are optional. + The 'dir' can be a relative or an absolute path. + If 'dir' is not specified, the output is created in the WORKDIR/out/ + If 'pfx' is not specified, the prefix 'other' is used + Examples: + '-other $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta + '-other dir_1/apfx' -> $PWD/dir_1/apfx.fasta + '-other dir_1/' -> $PWD/dir_1/other.fasta + '-other apfx' -> $PWD/apfx.fasta + '-other (no argument)' -> aligned_out/other.fasta + i.e. the same output directory + as used for aligned output + + --num_alignments INT Optional Positive integer (INT >=0). + + If used with '-no-best' reports first INT alignments per read reaching + E-value threshold, which allows to lower the CPU time and memory use. + Otherwise outputs INT best alignments. + If INT = 0, all alignments are output + + --no-best BOOL Optional Disable best alignments search False + The 'best' alignment is the highest scoring alignment out of All alignments of a read, + and the read can potentially be aligned (reaching E-value threshold) to multiple reference + sequences. + By default the program searches for best alignments i.e. performs an exhaustive search + over all references. Using '-no-best' will make the program to search just + the first N alignments, where N is set using '-num_alignments' i.e. 1 by default. + + --min_lis INT Optional Search only alignments that have the LIS 2 + of at least N seeds long + LIS stands for Longest Increasing Subsequence. It is computed using seeds, which + are k-mers common to the read and the reference sequence. Sorted sequences of such seeds + are used to filter the candidate references prior performing the Smith-Waterman alignment. + + --print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False + to SAM and/or BLAST tabular files + --paired BOOL Optional Flags paired reads False + If a single reads file is provided, use this option to indicate + the file contains interleaved paired reads when neither + 'paired_in' | 'paired_out' | 'out2' | 'sout' are specified. + + + --paired_in BOOL Optional Flags the paired-end reads as Aligned, False + when either of them is Aligned. + With this option both reads are output into Aligned FASTA/Q file + Must be used with 'fastx'. + Mutually exclusive with 'paired_out'. + + --paired_out BOOL Optional Flags the paired-end reads as Non-aligned, False + when either of them is non-aligned. + With this option both reads are output into Non-Aligned FASTA/Q file + Must be used with 'fastx'. + Mutually exclusive with 'paired_in'. + + --out2 BOOL Optional Output paired reads into separate files. False + Must be used with 'fastx'. + If a single reads file is provided, this options implies interleaved paired reads + When used with 'sout', four (4) output files for aligned reads will be generated: + 'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'. + If 'other' option is also used, eight (8) output files will be generated. + + --sout BOOL Optional Separate paired and singleton aligned reads. False + To be used with 'fastx'. + If a single reads file is provided, this options implies interleaved paired reads + Cannot be used with 'paired_in' | 'paired_out' + + --zip-out STR/BOOL Optional Controls the output compression '-1' + By default the report files are produced in the same format as the input i.e. + if the reads files are compressed (gz), the output is also compressed. + The default behaviour can be overriden by using '-zip-out'. + The possible values: '1/true/t/yes/y' + '0/false/f/no/n' + '-1' (the same format as input - default) + The values are Not case sensitive i.e. 'Yes, YES, yEs, Y, y' are all OK + Examples: + '-reads freads.gz -zip-out n' : generate flat output when the input is compressed + '-reads freads.flat -zip-out' : compress the output when the input files are flat + + --match INT Optional SW score (positive integer) for a match. 2 + + --mismatch INT Optional SW penalty (negative integer) for a mismatch. -3 + + --gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5 + + --gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2 + + -e DOUBLE Optional E-value threshold. 1 + Defines the 'statistical significance' of a local alignment. + Exponentially correllates with the Minimal Alignment score. + Higher E-values (100, 1000, ...) cause More reads to Pass the alignment threshold + + -F BOOL Optional Search only the forward strand. False + + -N BOOL Optional SW penalty for ambiguous letters (N's) scored + as --mismatch + + -R BOOL Optional Search only the reverse-complementary strand. False +``` ::: -Now we've worked out what parameters are appropriate for our data we can put them all together in one command. Since the command is quite long, we will use backward slashes to allow it to span several lines. `\` basically means "start a new line and carry on reading without submitting the command". - -First let's make sure we're in the `cs_course` directory, and make a new directory called `assembly` for the Flye output inside `results`. +There are two required arguments for SortMeRNA - reference (`--ref`) and reads (`--reads`). -``` {.bash filename="Code"} -cd ~/cs_course -mkdir results/assembly -``` +The reference is a database containing rRNA sequences which helps SortMeRNA determine whether reads are rRNA or not. We've already downloaded the rRNA database onto the instance. -Now we can start constructing our command! You can type/copy this code into your command line but don't press enter just yet. - -``` {.bash filename="Code"} -flye --nano-raw data/nano_fastq/ERR5000342_sub12_filtered.fastq \ - --out-dir results/assembly \ - --threads 8 \ - --iterations 3 \ - --meta -``` +The reads are our trimmed FASTQ files. Since our reads are paired, we have to give this argument twice and specify each file separately. The program will take their 'pairedness' into account while it is sorting. -- `--nano-raw` tells Flye that it is receiving pre-Guppy5 reads and that the **input** is found at the path `data/nano_fastq/ERR5000342_sub12_filtered.fastq` (note that we are using our 'filtered reads' - there'd be no point doing quality control and filtering otherwise!) -- `--out-dir` tells Flye that the **output** should be saved in the `results/assembly/` directory -- `--threads` indicates that the number of parallel cores is `8` -- `--iterations` indicates that the data will be polished `3` times -- `--meta` indicates that the dataset is a metagenome +We'll also use some other optional arguments to customise the command and make our output easier to work with later: -[**Don't run this command yet! --- If you have, you can press** Ctrl+z to stop the command.]{style="color:red"} +- `--workdir` tells SortMeRNA to put all outputs in a directory of our choice +- `--threads` tells it how many cores/processors to use +- `--aligned` tells it what we want the file containing aligned (i.e. rRNA) reads to be called +- `--other` tells it what we want unaligned (i.e. non-rRNA) reads to be called +- `--fastx` tells it we want our output files in fastq format +- `--paired-out` tells it that if one read in a pair is aligned and the other is not, both reads in the pair should be classed as unaligned -Now we've built our command we could stop here **but** metagenomic assembly takes a long time. If we were to run this command as is we'd have to stay logged into the instance (aka leaving your computer running) for hours. + -Luckily we don't have to do that as we're using a remote computer. +Once again, our command is going to be quite long, so we'll use `\` to separate it into multiple lines. We'll also use `&>` to redirect the terminal outputs to a file, like we did with Cutadapt. This time, though, there's one more step - telling the command to run in the background. ### Running a command in the background @@ -232,84 +249,107 @@ All the commands we have run so far have been in the "foreground", meaning they' Commands can also be run in the "background" so the prompt is returned before the command is finished and we can continue using our terminal. Commands run in the background are often called "jobs". A major advantage of running a long job in the background is that you can log out of your instance without killing the process. +SortMeRNA can take a while to run, which is why we're going to take advantage of this. + ::: callout-warning ## Warning If you run a job in the foreground it will stop as soon as you log out of the instance! This could cause a problem if you momentarily have unstable internet or your computer runs out of battery. Running long commands in the background means you are protected from these circumstances **and** means you can do other things in the terminal in the meantime. ::: -To run a command in the background, you follow it with an ampersand (`&`) symbol. +To run a command in the background, you add an ampersand (`&`) symbol to the end. When you press enter, your prompt will immediately return, but the code is still running. + +### Running SortMeRNA -But we're still not quite done! Flye is going to print a bunch of updates to the terminal while it runs. We need to tell it to send those updates to a file, otherwise the terminal will still be unusable despite using a `&`. We can do this with redirection: `> results/assembly/flye_output.txt` will send any output that would be sent to the terminal to a file called `flye_output.txt` (inside `results/assembly`) instead. +We're now ready to run our command! -The complete command is: +First make sure you're in your `cs_course` folder. +``` {.bash filename="Code"} +cd ~/cs_course +``` +Then enter the full command: ``` {.bash filename="Code"} -flye --nano-raw data/nano_fastq/ERR5000342_sub12_filtered.fastq \ - --out-dir results/assembly \ - --threads 8 \ - --iterations 3 \ - --meta &> results/assembly/flye_output.txt & +sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta \ + --reads data/trimmed_reads/SRR6820491_1.trim.fastq \ + --reads data/trimmed_reads/SRR6820491_2.trim.fastq \ + --workdir results/sortmerna \ + --aligned results/sortmerna/SRR6820491_rrna \ + --other results/sortmerna/SRR6820491_mrna \ + --threads 8 \ + --fastx --paired-out \ + &> results/sortmerna/SRR6820491_sortmerna.log & ``` -The first `&` sends the main command to the background. It is immediately followed by a `>` to redirect the logging and progress information to a file. Finally, the second `&` puts *that* command into the background too. +Once you press enter your prompt should immediately return and you will see a terminal printout below your command that looks something like this: +``` {.bash filename="Code"} +[1] 274852 +``` -We can now press enter to run the command. Your prompt should immediately return. This doesn't mean that the code has finished already: it is now running in the background. +The first number (in square brackets) is your 'job number'. The second number is a 'process ID (PID)'. These numbers allow you to refer to your job should you need to. ::: callout-note ## Running commands on different servers -There are many different ways to run jobs in the background in a terminal.\ -How you run these commands will depend on the computing resources (and their fair use policies) you are using. The main options include: +There are many different ways to run jobs in the background in a terminal. How you run these commands will depend on the computing resources (and their fair use policies) you are using. The main options include: - `&`, which we've covered here. Depending on the infrastructure you're running the command on, you may also need to use [`nohup`](https://www.digitalocean.com/community/tutorials/nohup-command-in-linux) to prevent the background job from being killed when you close the terminal.\ - The command line program [`screen`](https://linuxize.com/post/how-to-use-linux-screen/), which allows you to create a shell session that can be completely detached from a terminal and re-attached when needed. - Queuing system - many shared computing resources, like the High Performance Computing (HPC) clusters owned by some Universities, operate a queuing system (e.g. SLURM or SGE) so each user gets their fair share of computing resources. With these you submit your command / job to the queueing system, which will then handle when to run the job on the resources available. ::: -As we're running the command in the background we no longer see the output on the terminal but we can still check on the progress of the assembly. There are two ways to do this. +As we're running the command in the background we no longer see the output on the terminal but we can still check on the progress of the command. There are two ways to do this. 1. Using the command `jobs` to view what is running -2. Examining the log file created by `flye` using `less` +2. Examining the log file created by SortMeRNA using `less` ### Checking progress: `jobs` -Jobs command is used to list the jobs that you are running in the background and in the foreground. If the prompt is returned with no information no commands are being run. +The jobs command is used to list the jobs that you are running in the background and in the foreground. If the prompt is returned with no information no commands are being run. ``` {.bash filename="Code"} jobs ``` ``` {.default filename="Output"} -[1]+ Running flye --nano-raw data/nano_fastq/ERR5000342_sub12_filtered.fastq --out-dir results/assembly --threads 8 --iterations 3 --meta &> results/assembly/flye_output.txt & +[1]+ Running sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log & ``` The `[1]` is the job number. If you need to stop the job running, you can use `kill %1`, where 1 is the job number. ### Checking progress: the log file -Flye generates a log file when running, which is stored in the output folder it has generated. Using `less` we can navigate through this file. +We can also look at the log file we generated to store the command's terminal outputs. Using `less` we can navigate through this file. ``` {.bash filename="Code"} -cd results/assembly -less flye.log +cd results/sortmerna +less SRR6820491_sortmerna.log ``` -The contents of the file will depend on how far through the assembly Flye is. At the start of an assembly you'll probably see something like this: +The contents of the file will depend on how far through the process SortMeRNA is. At the start of an assembly you'll probably see something like this: ``` {.default filename="Output"} -[2022-10-05 17:22:03] INFO: Starting Flye 2.9.1-b1780 -[2022-10-05 17:22:03] INFO: >>>STAGE: configure -[2022-10-05 17:22:03] INFO: Configuring run -[2022-10-05 17:22:17] INFO: Total read length: 3023658929 -[2022-10-05 17:22:17] INFO: Reads N50/N90: 5389 / 2607 -[2022-10-05 17:22:17] INFO: Minimum overlap set to 3000 -[2022-10-05 17:22:17] INFO: >>>STAGE: assembly +[process:1393] === Options processing starts ... === +Found value: sortmerna +Found flag: --ref +Found value: databases/sortmerna_database/smr_v4.3_default_db.fasta of previous flag: --ref +Found flag: --reads +Found value: data/trimmed_reads/SRR6820491_1.trim.fastq.gz of previous flag: --reads +Found flag: --reads +Found value: data/trimmed_reads/SRR6820491_2.trim.fastq.gz of previous flag: --reads +Found flag: --workdir +Found value: results/sortmerna/SRR6820491_temp/ of previous flag: --workdir +Found flag: --aligned +Found value: results/sortmerna/SRR6820491_temp/SRR6820491_rrna of previous flag: --aligned +Found flag: --other +Found value: results/sortmerna/SRR6820491_temp/SRR6820491_mrna of previous flag: --other +Found flag: --threads +Found value: 8 of previous flag: --threads +Found flag: --fastx +Found value: --paired-out of previous flag: --fastx ``` -Different steps in the assembly process take different amounts of time so it might appear stuck. However, it is almost certainly still running if it was run in the background. - -Note: this log file will contain data similar to the data in the `flye_output.txt` file we're generating when redirecting the terminal output. But it's easier to look at the log file as flye will always generate that even if you're running the command differently (e.g. in the foreground). +The file won't update until you close and open it again, so you'll need to keep checking back to see what progress is being made. Different steps in the process take different amounts of time so it might appear stuck. However, it is almost certainly still running if it was run in the background. ::: callout-note ## Navigation commands in `less`: @@ -322,132 +362,39 @@ Note: this log file will contain data similar to the data in the `flye_output.tx | G | to go to the end | | q | to quit | -See [Prenomics - Working with Files and Directories](https://cloud-span.github.io/prenomics02-command-line/02-working-with-file/index.html) for a full overview on using `less`. +See [Lesson 2 - Working with Files and Directories](../lesson02-using-the-command-line/02-working-with-file.qmd) for a full overview on using `less`. ::: -Flye is likely to take **up to 5 hours** to finish assembling - so feel free to leave this running overnight and come back to it tomorrow. You don't need to remain connected to the instance during this time (and you can turn your computer off!) but once you have disconnected from the instance it does mean you can no longer use `jobs` to track the job. +SortMeRNA is likely to take **about 70 minutes** to finish, so feel free to leave this running while you do something else and come back to it later. You don't need to remain connected to the instance during this time but once you have disconnected from the instance it does mean you can no longer use `jobs` to track the job. -In the meantime, if you wanted to read more about assembly and metagenomics there's a few papers and resources at the end with recommended reading. +### Determining if SortMeRNA has finished -### Determining if the assembly has finished - -After leaving it several hours, Flye should have finished assembling. +After leaving it for an hour or so, SortMeRNA should have finished running. If you remained connected to the instance during the process you will be able to tell it has finished because you get the following output in your terminal when the command has finished. ``` {.default filename="Output"} -[2]+ Done flye --nano-raw data/nano_fastq/ERR5000342_sub12_filtered.fastq --out-dir results/assembly --threads 8 --iterations 3 --meta &> results/assembly/flye_output.txt & +[1]+ Done sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log & ``` -This message won't be displayed if you disconnected from the instance for whatever reason during the assembly process. However, you can still examine the `flye.log` file in the `assembly` directory. If the assembly has finished the log file will have summary statistics and information about the location of the assembly at the end. +This message won't be displayed if you disconnected from the instance for whatever reason during the assembly process. However, you can still examine the log file in the `sortmrna` directory. If the assembly has finished the log file will have ... -Move to the `assembly` directory and use `less` to examine the contents of the log file: +Move to the `sortmerna` directory and use `less` to examine the contents of the log file: ``` {.bash filename="Code"} -cd ~/cs_course/results/assembly/ -less flye.log +cd ~/cs_course/results/sortmerna/ +less SRR6820491_sortmerna.log ``` Navigate to the end of the file using G. You should see something like: ``` {.default filename="Output"} -[2024-05-02 16:42:37] root: INFO: Assembly statistics: - - Total length: 11947363 - Fragments: 787 - Fragments N50: 19835 - Largest frg: 95862 - Scaffolds: 0 - Mean coverage: 7 - -[2024-05-02 16:42:37] root: INFO: Final assembly: /home/csuser/cs_course/results/assembly/assembly.fasta -``` - -There are some basic statistics about the final assembly created. - -### What is the Assembly output? - -If we use `ls` in the `assembly` directory we can see the that Flye has created many different files. - -``` {.default filename="Output"} -00-assembly 20-repeat 40-polishing assembly_graph.gfa assembly_info.txt params.json -10-consensus 30-contigger assembly.fasta assembly_graph.gv flye.log -``` - -One of these is `flye.log` which we have already looked at. - -- Flye generates a directory to contain the output for each step of the assembly process. (These are the `00-assembly`, `10-consensus`, `20-repeat`, `30-contigger` and `40-polishing` directories.)\ -- We also have a file containing the parameters we ran the assembly under `params.json` which is useful to keep our results reproducible.\ -- The assembled contigs are in FASTA format (`assembly.fasta`), a common standard file type for storing sequence data without its quality scores.\ -- There's a text file which contains more information about each contig created (`assembly_info.txt`). -- Finally we have two files for a repeat graph (`assembly_graph.gfa` or `assembly_graph.gv`) which is a visual way to view the assembly. - -You can see more about the output for Flye in the [documentation on GitHub](https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#output). - -::: callout-note -## Contigs vs. reads - -We have seen reads in the raw sequencing data - these are our individual jigsaw pieces. - -Contigs (from the word *contiguous*) are longer fragments of DNA produced after raw reads are joined together by the assembly process. These are like the chunks of the jigsaw puzzle the assembler has managed to complete. - -Contigs are usually much longer than raw reads but vary in length and number depending on how successful the assembly has been. -::: - -## Assembly Statistics - -Flye gave us basic statistics about the size of the assembly but not all assemblers do. We can use [Seqkit](https://bioinf.shenwei.me/seqkit/) to calculate summary statistics from the assembly. We previously used another `Seqkit` command, `seq` to Filter our Nanopore sequences by quality. This time we will use the command `stats`. - -Make sure you are in the `cs_course` folder then run `seqkit stats` on `assembly.fasta`: - -``` {.bash filename="Code"} -cd ~/cs_course -seqkit stats results/assembly/assembly.fasta +[writeReports:268] === done Reports in 213.494 sec === ``` -`SeqKit` is fast so we have run the command in the terminal foreground. It should take just a couple of seconds to process this assembly. Assemblies with more sequencing data can take a bit longer. +## SortMeRNA Outputs -Once it has finished you should see an output table like this: ``` {.default filename="Output"} -file format type num_seqs sum_len min_len avg_len max_len -results/assembly/assembly.fasta FASTA DNA 787 11,947,363 923 15,180.9 95,862 -``` - -This table shows the input file, the format of the file, the type of sequence and other statistics. The assembly process introduces small random variations in the assemly so your table will likely differ slightly. However, you should expect the numbers to be very similar. - -Using this table of statistics, answer the questions below. - -::: callout-note -## Exercise 1: Looking at basic statistics - -Using the output for seqkit stats above, answer the following questions. - -a) How many contigs are in this assembly?\ -b) How many bases in total have been assembled?\ -c) What is the shortest and longest contig produced by this assembly? - -::: {.callout-note collapse="true"} -## Solution - -From our table:\ -a) From `num_seqs` we can see that this assembly is made up of 787 contigs\ -b) Looking at `sum_length` we can see that the assembly is 11,947,363bp in total (nearly 12 million basepairs!)\ -c) From `min_length` we can see the shortest contig is 923bp and from `max_length` the longest contig is 95,862bp -::: -::: - -::: callout-tip -## Recommended reading: - -While you're waiting for the assembly to finish here are some things you might want to read about: - -- An overall background to the history of DNA sequencing in [DNA sequencing at 40: past, present and future](https://www.nature.com/articles/nature24286)\ -- An overview of a metagenomics project [Shotgun metagenomics, from sampling to analysis](https://www.nature.com/articles/nbt.3935) - though note this paper is from 2017 so some techniques and software will be different now.\ -- The challenges of genomic and metagenomic assembly and the algorithms that have been built to overcome these in [Assembly Algorithms for Next-Generation Sequencing Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646/)\ -- The approach Flye uses to assemble metagenomes is covered in [metaFlye: scalable long-read metagenome assembly using repeat graphs](https://www.nature.com/articles/s41592-020-00971-x) -- Comparison of genome assembly for bacteria [Comparison of De Novo Assembly Strategies for Bacterial Genomes](https://www.mdpi.com/1422-0067/22/14/7668/htm) -- Benchmarking of assemblers including flye in prokaryotes [Benchmarking of long-read assemblers for prokaryote whole genome sequencing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6966772/) -- Comparison of combined assembly and polishing method [Trycycler: consensus long-read assemblies for bacterial genomes](https://link.springer.com/article/10.1186/s13059-021-02483-z) -- Using nanopore to produce ultra long reads and contiguous assemblies [Nanopore sequencing and assembly of a human genome with ultra-long reads](https://www.nature.com/articles/nbt.4060) -::: +idx kvdb readb SRR6820491_mrna.fq SRR6820491_rrna.fq SRR6820491_rrna.log SRR6820491_sortmerna.log +``` \ No newline at end of file