Skip to content

Commit

Permalink
Start renaming pages and qmd files to match new lessons/episodes. Con…
Browse files Browse the repository at this point in the history
…tent still metagenomics.
  • Loading branch information
evelyngreeves committed Jul 10, 2024
1 parent 7e39cee commit fd84ff9
Show file tree
Hide file tree
Showing 10 changed files with 22 additions and 589 deletions.
21 changes: 2 additions & 19 deletions docs/lesson03-qc-pre-processing/01-introduction-meta.qmd
Original file line number Diff line number Diff line change
@@ -1,21 +1,5 @@
---
title: "Introduction to Metagenomics"
teaching: 30
exercises: 5
questions:
- What is metagenomics?
- When should we use metagenomics?
- What does a metagenomics project look like?
objectives:
- Explain the difference between genomics, metagenomics and amplicon sequencing.
- Familiarise yourself with the metagenomics dataset used in this course.
keypoints:
- Genomics looks at the whole genome content of an organism
- Metagenomes contain multiple organisms within one sample unlike genomic samples.
- In metagenomes the organisms present are not usually present in the same abundance - except for mock communities.
- We can identify the organisms present in a sample using either amplicon sequencing or whole metagenome sequencing. Amplicon sequencing is cheaper and quicker, but it also limits the amount of downstream analysis that can be done with the data.
- Metagenomes can differ in their levels of complexity and this is determined by how many organisms are in the metagenome.
- Difference platforms allow us to perform different analyses. The suitability depends on the question you are asking.
title: "Introduction to Metatranscriptomics"
---

## What is the difference between Genomics and Metagenomics?
Expand Down Expand Up @@ -63,7 +47,7 @@ For organisms that are well characterised, establishing identity can give you in
Despite this, there are workflows such as [QIIME2](https://qiime2.org/), which are free and community led, which use database annotations of the reference versions of the organisms identified from the amplicon, to suggest what metabolic functions maybe present. The amplicon sequence is also limited because species may have genomic differences, but may be indistinguishable from the amplicon sequence alone. This means that amplicon sequencing can rarely resolve to less than a genus level.

| Attribute | Amplicon | Whole genome metagenomics |
|---------------------|----------------------|------------------------------|
|---------------------|----------------------|-----------------------------|
| Cost | Cheap | Expensive |
| Coverage depth | High | Lower - medium |
| Taxonomy detection | Specific to amplicons used | All in sample |
Expand Down Expand Up @@ -128,7 +112,6 @@ Here is an example of the workflow we will be using for our analysis with a brie
4. Binning - separating out genomes into 'bins' containing related contigs
5. Taxonomic assignment - assigning taxonomy and functional analysis to sequences/contigs


Workflows in bioinformatics often adopt a plug-and-play approach so the output of one tool can be easily used as input to another tool. The use of standard data formats in bioinformatics (such as FASTA or FASTQ, which we will be using here) makes this possible. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.

You can find a [more detailed version of the workflow](/docs/miscellanea/extras/workflow.qmd) we will be following by going to `Extras` and selecting `Workflow Reference`. This diagram contains all of the steps followed over the course alongside program names.
Expand Down
21 changes: 2 additions & 19 deletions docs/lesson03-qc-pre-processing/02-QC-raw-reads.qmd
Original file line number Diff line number Diff line change
@@ -1,20 +1,5 @@
---
title: "Assessing Read Quality, Trimming and Filtering"
teaching: 45
exercises: 45
questions:
- "How can I describe the quality of my data?"
- "How can we get rid of sequence data that doesn't meet our quality standards?"
- "How do these methods differ when looking at Nanopore data?"
objectives:
- "Interpret a FastQC plot summarizing per-base quality across all reads."
- "Interpret the NanoPlot output summarizing a Nanopore sequencing run"
- "Filter Nanopore reads based on quality using the command line tool SeqKit"
keypoints:
- "Quality encodings vary across sequencing platforms."
- "It is important to know the quality of our data to be able to make decisions in the subsequent steps."
- "Data cleaning is essential at the beginning of metagenomics workflows."
- "Due to differences in the sequencing technology Nanopore data must be handled differently."
title: "Quality Control of Raw Reads"
---

## Getting started
Expand Down Expand Up @@ -282,7 +267,7 @@ cd ~/cs_course

Now you can enter the command, using `-o` to tell FastQC to put its output files into our newly-made `illumina_qc/` directory.

::: {.callout-tip}
::: callout-tip
### Why here?

You might be wondering why we're running our command from the `cs_course` directory and not the place where the data is stored (`~/cs_course/data/illumina_fastq`), or where we want our outputs to end up (`~/cs_course/results/qc/illumina_qc`). The reason is that it's *best practice* not to run commands from the same folder as your data in case you accidentally do something which would overwrite your data files. From `cs_course` we can easily "see" both our `data` and `results` directories to refer to them with local paths.
Expand Down Expand Up @@ -605,7 +590,6 @@ ls

<!-- :::{.column-screen} THE WHOLE SCREEN WIDTH -->


``` {.default filename="Output"}
LengthvsQualityScatterPlot_dot.html NanoStats.txt
LengthvsQualityScatterPlot_dot.png Non_weightedHistogramReadlength.html
Expand All @@ -619,7 +603,6 @@ NanoPlot_202303307_1630.log Yield_By_Length.png
NanoPlot-report.html
```


We can see that NanoPlot has generated a lot of different files.

Like before, we can't view most of these files in our terminal as we can't open images or HTML files. Instead we'll download the core information to our own computer. Luckily, the `NanoPlot-report.html` file contains all of the plots and information held in the other files so we only need to download that one onto our local computer using `scp`.
Expand Down
15 changes: 1 addition & 14 deletions docs/lesson03-qc-pre-processing/03-rrna-filtering.qmd
Original file line number Diff line number Diff line change
@@ -1,18 +1,5 @@
---
title: "Metagenome Assembly"
teaching: 40
exercises: 40
questions:
- "Why do raw reads need to be assembled?"
- "How does metagenomic assembly differ from genomic assembly?"
- "How can we assemble a metagenome?"
objectives:
- "Run a metagenomic assembly workflow."
- "Assess the quality of an assembly using SeqKit"
keypoints:
- "Assembly merges raw reads into contigs."
- "Flye can be used as a metagenomic assembler."
- "Certain statistics can be used to describe the quality of an assembly."
title: "Ribosomal RNA Filtering"
---

::: callout-important
Expand Down
21 changes: 3 additions & 18 deletions docs/lesson04-taxonomic-annotation/01-community-structure.qmd
Original file line number Diff line number Diff line change
@@ -1,20 +1,5 @@
---
title: "Polishing an assembly"
teaching: 30
exercises: 10
questions:
- "Why do assemblies need to be polished?"
- "What are the different purposes of polishing with short and long reads?"
- "What software can we use to do long and short read polishing?"

objectives:
- "Understand why polishing metagenomes is important."
- "Understand the different programs used to do short and long read polishing."
keypoints:
- "Short reads have a higher base accuracy than long reads and can be used to remove errors in assemblies generated with long reads."
- "Long reads have a lower accuracy but help generate a more contiguous (less fragmented) assembly, so are used to get the structure of the metagenome, but may have small misassemblies or single nucleotide polymorphisms (SNPs)"
- "Medaka is used to polish an assembly with long reads."
- "Pilon is used to polish an assembly with short reads."
title: "Extracting a Community Profile"
---

In the [previous episode](/docs/lesson03-qc-assembly/03-assembly.qmd) we generated a draft assembly using Flye from our long read Nanopore data.
Expand Down Expand Up @@ -91,7 +76,7 @@ To use Medaka we need to specify certain parameters in the command, like we did
Let's have a look at the flags and options we're going to use:

| Flag/option | Meaning | Our input |
|-------------------|-----------------------------------|-------------------|
|-------------------|----------------------------------|-------------------|
| `-i` | Input basecalls (i.e. what we are polishing with) | `-i data/nano_fastq/ERR5000342_sub12_filtered` |
| `-d` | Input assembly (i.e. what is being polished) | `-d results/assembly/assembly.fasta` |
| `-m` | Neural network model to use (described in [the documentation](https://github.com/nanoporetech/medaka#models)) | `-m r941_min_hac_g507` |
Expand Down Expand Up @@ -327,7 +312,7 @@ Here are the various flags/options used in these commands and what they mean:
<!-- NB: below, the number of hyphens (---..) in the line below "|---|---...|" does help determine the length of each column. It's a matter of trying how many hyphens to use for each column -->

| Command | Flag/option | Meaning |
|------------------------------|-------------------|------------------------|
|-----------------------------|-------------------|------------------------|
| `bwa mem -t 8 [input assembly] [input short read file(s)]` | -t 8 | Number of threads (8) |
| `samtools view - -Sb` | \- | Take piped output from `bwa mem` as input |
| | -Sb | Convert from SAM to BAM format |
Expand Down
26 changes: 4 additions & 22 deletions docs/lesson04-taxonomic-annotation/02-visualising-structure.qmd
Original file line number Diff line number Diff line change
@@ -1,23 +1,5 @@
---
title: "QC polished assembly"
teaching: 30
exercises: 40
questions:
- "Why would we quality control (QC) an assembly?"
- "How can we perform QC on an assembly?"
- "What metrics can we compare between assemblies to understand the quality of an assembly?"
objectives:
- "Understand the terms N50, misassembly and largest contig."
- "Understand what factors might affect the quality of an assembly."
- "Use the help documentation to work out an appropriate flag for seqkit"
- "Apply seqkit to assess multiple assemblies"
- "Use MetaQUAST to identify the quality of an assembly."
keypoints:
- "The N50 is the contig length of the 50th percentile, meaning that 50% of the contigs are at least this length in the assembly"
- "A misassembly is when a portion of the assembly is incorrectly put back together"
- "The largest contig is the longest contiguous piece in the assembly"
- "Seqkit can generate summary statistics that will tell us the N50, largest contig and the number of gaps"
- "MetaQUAST can generate additional information in a report which can be used to identify misassemblies"
title: "Visualising Community Structure"
---

## Why QC a metagenome assembly?
Expand Down Expand Up @@ -161,7 +143,7 @@ a)
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
assembly.fasta FASTA DNA 1,161 18,872,828 528 16,255.7 118,427 7,513 11,854 19,634 0 20,921 0 0 66.26
```

(Your numbers will probably be slightly different to the solution given, as the assembly algorithm runs differently each time. As long as they are in the same ballpark there's no need to worry!)

b) The N50 length for this assembly is 20,921 bp. This tells us that 50% of the assembly is in fragments that are nearly 21,000 bases long or longer!
Expand Down Expand Up @@ -306,9 +288,9 @@ If you are using a Mac, you might be more familiar with the `Command` key, which
::: callout-note
## Copying and pasting in Git bash

Most people will want to use <kbd>Ctrl</kbd>+<kbd>C</kbd> and <kbd>Ctrl</kbd>+<kbd>V</kbd> to copy and paste. However in GitBash these shortcuts have other functions. <kbd>Ctrl</kbd>+<kbd>C</kbd> interrupts the currently running command and <kbd>Ctrl</kbd>+<kbd>V</kbd> tells the terminal to treat every keystroke as a literal character, so will add shortcuts like <kbd>Ctrl</kbd>+<kbd>C</kbd> as characters. Instead you can copy and paste using the mouse:
Most people will want to use <kbd>Ctrl</kbd>+<kbd>C</kbd> and <kbd>Ctrl</kbd>+<kbd>V</kbd> to copy and paste. However in GitBash these shortcuts have other functions. <kbd>Ctrl</kbd>+<kbd>C</kbd> interrupts the currently running command and <kbd>Ctrl</kbd>+<kbd>V</kbd> tells the terminal to treat every keystroke as a literal character, so will add shortcuts like <kbd>Ctrl</kbd>+<kbd>C</kbd> as characters. Instead you can copy and paste using the mouse:

- Left click and drag to highlight text, then right click to copy. Move the cursor to where you want to paste and right click to paste.
- Left click and drag to highlight text, then right click to copy. Move the cursor to where you want to paste and right click to paste.
:::

You should then be able to see this file when you `ls` and view it using `less`.
Expand Down
17 changes: 3 additions & 14 deletions docs/lesson05-functional-annotation/01-functional-info.qmd
Original file line number Diff line number Diff line change
@@ -1,16 +1,5 @@
---
title: "Metagenome Binning"
teaching: 50
exercises: 50
questions:
- "How can we obtain the original genomes from a metagenome?"
objectives:
- "Obtain Metagenome-Assembled Genomes (MAGs) from the metagenomic assembly."
- "Understand that there are multiple methods that can be used to perform binning"
keypoints:
- "Metagenome-Assembled Genomes (MAGs) sometimes are obtained from curated contigs grouped into bins."
- "Use MetaBAT2 to assign the contigs to bins of different taxa."
- "Other programmes are available that are generating other bins, and these can be rationalised using tools such as DAStools"
title: "Extracting Functional Information"
---

Now we are ready to start doing analysis of our metagenomic assembly!
Expand Down Expand Up @@ -82,7 +71,7 @@ One way to separate contigs that belong to different species is by their taxonom
Most binning tools use short reads for the binning; only a few use Hi-C sequencing. Hi-C is a method of sequencing that gives spatial proximity information, as described [here](https://en.wikipedia.org/wiki/Hi-C_(genomic_analysis_technique)). Different tools use different algorithms for performing the binning. A few popular tools are summarised below. For more information see Section 2.4 (Tools for metagenome binning) of [this review](https://www.sciencedirect.com/science/article/pii/S2001037021004931#s0045).

| Tool | Core algorithm | Website | Publication |
|---------------|---------------|---------------|---------------------------|
|----------------|----------------|----------------|------------------------|
| MaxBin2 | Expectation-maximization | http://sourceforge.net/projects/maxbin/ | [Wu et al, 2016](https://academic.oup.com/bioinformatics/article/32/4/605/1744462) |
| CONCOCT | GaussiAN Mixture Models | https://github.com/BinPro/CONCOCT | [Alneberg et al, 2014](https://www.nature.com/articles/nmeth.3103) |
| MetaBAT2 | Label propagation | https://bitbucket.org/berkeleylab/metabat | [Kang et al, 2019](https://peerj.com/articles/7359/) |
Expand Down Expand Up @@ -222,7 +211,7 @@ The penultimate line tells us that MetaBAT has produced 90 bins containing 21216

Using `ls` will show that MetaBAT2 has generated a depth file (`assembly_ERR5000342.fasta.depth.txt`) and a directory (`assembly_ERR5000342.fasta.metabat-bins1500-YYYYMMDD_HHMMSS/`). Annoyingly, the "easy" way of running MetaBat2 (which we just used) doesn't allow us to specify an output directory, so we'll need to move our outputs into `results/binning` manually using `mv`.

``` {.bash}
``` bash
mv assembly* results/binning
```

Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,5 @@
---
title: "QC of metagenome bins"
teaching: 50
exercises: 10
questions:
- "How can we assess the quality of the metagenome bins?"
objectives:
- "Check the quality of the Metagenome-Assembled Genomes (MAGs)."
- "Understanding MIMAG quality standards."
keypoints:
- "CheckM can be used to evaluate the quality of each Metagenomics-Assembled Genome."
- "We can use the percentage contamination and completion to identify the quality of these bins."
- "There are MIMAG standards which can be used to categorise the quality of a MAG."
- "Many MAGs will be incomplete, but that does not mean that this data is not still useful for downstream analysis."
title: "Normalising Abundances and Identifying Gene Families"
---

## Quality check
Expand Down Expand Up @@ -176,7 +164,7 @@ As part of the standard, a framework to determine MAG quality from statistics is
See the table below for an overview of each category.

| Quality Category | Completeness | Contamination | rRNA/tRNA encoded |
|-----------------|-----------------|-----------------|-----------------------|
|-----------------|-----------------|-----------------|---------------------|
| High | \> 90% | ≤ 5% | Yes (≥ 18 tRNA and all rRNA) |
| Medium | ≥ 50% | ≤ 10% | No |
| Low | \< 50% | ≤ 10% | No |
Expand Down
Loading

0 comments on commit fd84ff9

Please sign in to comment.