-
Notifications
You must be signed in to change notification settings - Fork 1
/
02.04-sequencing_libraries.Rmd
169 lines (107 loc) · 18.5 KB
/
02.04-sequencing_libraries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Sequencing library preparation
Sequencing library preparation is a crucial step in the process of DNA sequencing. It involves the conversion of fragmented DNA molecules into a format that is compatible with the sequencing platform. The goal of library preparation is to create a collection of DNA fragments, each with sequencing adapters attached, which enables high-throughput sequencing of the DNA molecules. This process ensures that the genetic information contained in the DNA sample can be accurately and efficiently read by the sequencing instrument.
### Sequencing strategies and platforms {- #sequencing-strategies-platforms}
Library features are specific to each sequencing platform, which requires selecting in advance the sequencing strategy to be employed. Pure nucleic acid sequencing-based strategies can be broadly divided in two groups. Short-read sequencing (SRS) platforms provide large amounts of data yet with short sequencing reads (typically 150 nucleotides). In contrast, long-read sequencing (LRS) platforms yield much longer sequences (thousands or even million of nucleotides), yet with a lower throughput, and typically lower sequence quality. The SRS market is dominated by two main companies with proprietary platforms, namely Illumina and BGI, although PacBio recently released their own SRS platform called ONSO. The LRS market is also dominated by two different companies with proprietary technologies, which are Oxford Nanopore (ONT) and Pacific Biosciences (Pacbio).
Sequencing enterprises, as well as auxiliary biotechnological companies, provide library preparation kits that can be more or less customised for different purposes.
|Technology|Platforms|Sequencing type|Company|
|---|---|---|---|
|Sequencing by synthesis (SBS)|MiSeq, NovaSeq|Short-read sequencing|[Illumina](https://www.illumina.com/)|
|Combinatorial probe-anchor synthesis (cPAS)|DNBSeq|Short-read sequencing|[BGI](https://www.bgi.com/global)|
|Sequencing by binding (SBB) technology|Onso|Short-read sequencing|[PacBio](https://www.pacb.com/)|
|Single Molecule Real-Time sequencing (SMRT)|Sequel, Revio|Long-read sequencing|[PacBio](https://www.pacb.com/)|
|Nanopore sequencing|MinION, GridION, PromethION|Long-read sequencing|[Oxford Nanopore](https://nanoporetech.com/)|
:::: {.imagenote}
Some of the most widely used sequencing technologies and platforms.
:::
### PCR-based vs. PCR-free library preparation {- #pcr-library}
Sequencing library preparation procedures can be split into two main groups depending on whether they PCR-amplify or not the DNA templates. Unlike in the case of targeted amplicon sequencing, in which the objective is to amplify a specific target region, the aim of including a PCR step in shotgun-based library preparation is to increase the molarity of the library and/or to attach indices (see below) to the adaptors.
:::: {.tipbox}
Learn more about PCR-based and PCR-free library preparation in [this article](https://www.pnas.org/doi/abs/10.1073/pnas.1519288112) by Jones et al. [@Jones2015-bk].
:::
### Indices and multiplexing {- #indices-multiplexing}
Usually, library preparation also entails tagging molecules with unique sample identifiers known as indices, which enable pooling molecules derived from multiple samples in a single sequencing run. This can be achieved in PCR-free protocols by using adaptors containing unique indices per sample, or by using indexed amplification primers in PCR-based library preparation protocols.
:::: {.tipbox}
Learn more about indices and multiplexing in [this article](https://academic.oup.com/nar/article/40/1/e3/1287690) by Kircher et al. [@Kircher2012-vy].
:::
### Unique molecular identifiers (UMIs) {- #unique-molecular-identifiers}
Unique molecular identifiers (UMIs) are a type of molecular barcoding that provides error correction and increased accuracy during sequencing by uniquely tag each molecule (rather than each pool of molecules derived from a sample) in a sample library. UMIs are used for a wide range of sequencing applications, many around PCR duplicates in DNA and cDNA. UMI deduplication is also useful for RNA-seq gene expression analysis and other quantitative sequencing methods.
:::: {.tipbox}
Learn more about unique molecular identifiers in [this article](https://www.nature.com/articles/nmeth.1778) by Kivioja et al. [@Kivioja2011-fe].
:::
:::: {.authorbox}
Contents of this section were created by [Antton Alberdi](#antton-alberdi).
:::
## Host genomics and microbial metagenomics {#library-meta-genomics}
The library preparation strategies for generating host genomic (HG) and microbial metagenomic (MG) data are generally the same.
#### DNA fragmentation {- #dna-fragmentation}
Short-read sequencing libraries requires DNA to be sheared to the desired fragment-length (usually 400-500 nucleotides), which can be achieved using either chemical (e.g., restriction enzymes) or physical (e.g. ultrasonication) procedures. Some long-read sequencing libraries intend to keep the largest DNA molecules possible, although some others recommend fragmenting to optimal mid-length molecules (e.g., around 10,000 nucleotides for Pacbio HiFi). After fragmentation, many library preparation protocols require repairing molecule ends by converting 5'-protruding and/or 3'-protruding ends to 5'-phosphorylated, blunt-end (see below) molecules.
#### Adaptor ligation {- #adaptor-ligation}
In shotgun libraries adaptors are merged to DNA template molecules through chemical ligation (e.g., using a ligase enzyme). The ligation process is slightly different depending on whether the DNA template has blunt- or sticky-ends. In blunt ends, both strands are of equal length – i.e. they end at the same base position, leaving no unpaired bases on either strand, while in sticky ends, one strand is longer than the other. Some protocols deliberately create sticky-ends from blunt-end fragmented DNA molecules by adding a single adenine base to form an overhang by an A-tailing reaction. This A overhang allows adapters containing a single thymine overhanging base to pair with the DNA fragments.
An example of a blunt-end molecule:
```
5'-GATCTGACTGATGCGTATGCTAGT-3'
3'-CTAGACTGACTACGCATACGATCA-5'
```
An example of a sticky-end molecule:
```
5'-GATCTGACTGATGCGTATGCTAGT-3'
3'-CTAGACTGACTACGCATACGATC-5'
```
### List of available protocols {- #library-meta-genomics-protocols}
|Type|Name|Author/owner|Protocol/Article|
|---|---|---|---|
|SRS|Blunt-End Single-Tube (BEST) library prep protocol|Open access|[Article](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12871)|
|SRS|Santa Cruz Reaction (SCR) single-stranded library prep protocol|Open access|[Article](https://academic.oup.com/jhered/article/112/3/241/6188529?login=true)|
|SRS| |Illumina|[Protocol](#)|
|LRS|SMRTbell prep kit 3.0 for PacBio HiFi Sequencing|Pacbio|[Protocol](https://www.pacb.com/wp-content/uploads/Procedure-checklist-Preparing-whole-genome-and-metagenome-libraries-using-SMRTbell-prep-kit-3.0.pdf)|
## Host transcriptomics {#library-host-transcriptomics}
Library preparation for host transcriptomics (HT) requires some extra steps to the already described procedures for [host genomics and microbial metagenomics](##library-meta-genomics). This is due to two main reasons. First, because RNA molecules cannot be directly built into most sequencing libraries, and require instead to generate complementary DNA (cDNA) before library preparation. Second, because gene transcripts tend to be overwhelmingly dominated by rRNA and mtDNA genes, which are often not of interest for the researcher.
#### Sample quality assessment {- #rrna-sample-quality-assessment}
Before starting any library preparation protocol assessing the quality of RNA samples is strongly recommended. While traditionally assessed through agarose gel electrophoresis, nowadays RNA quality assessment is performed on electropherogram profiles, which are produced by nucleic acid fragment analysis instruments (e.g. Bioanalyzer, Fragment Analyzer). Traditionally, a simple model evaluating the 28S to 18S rRNA ratio was used as a criterion for RNA quality. However, the most common metric currently employed for assessing the preservation quality of RNA is the RNA integrity number (RIN), which accounts for more RNA features for assessing sample quality [@Schroeder2006-dn]. RIN values range from 10 (intact RNA) to 1 (totally degraded RNA). For example, the poly(A) enrichment procedures explained below require high quality RNA (RIN > 8), because RNA degradaation to breaks within the transcript body and due to the selection of the poly(A) tail, the 3' ends are enriched while the more 5’ sequences would not be captured, leading to a strong 3' bias for degraded RNA inputs.
#### DNA removal {- #DNA-removal}
Depending on the RNA extraction method employed, it is not rare trace amounts of genomic DNA (gDNA) to be co-purified with RNA. Contaminating gDNA can interfere with reverse transcription and may lead to false positives, higher background, or lower detection in sensitive applications such as RT-qPCR. The traditional method of gDNA removal is the addition of DNase I to RNA extracts. DNase I must be removed prior to cDNA synthesis since any residual enzyme would degrade single-stranded DNA. Unfortunately, RNA loss or damage can occur during DNase I inactivation treatment. As an alternative to DNase I, double-strand–specific DNases are available to eliminate contaminating gDNA without affecting RNA or single-stranded DNAs.
#### Stranded vs. non-stranded transcriptomics {- #stranded-transcriptomics}
RNA-Seq libraries can be stranded or non-stranded (unstranded), a decision that affects data analysis and interpretation. Stranded RNA-Seq (also referred to as strand-specific or directional RNA-Seq) enables researchers to determine the orientation of the transcript, whereas this information is lost in non-stranded, or standard, RNA-Seq. Non-stranded RNA-Seq is often sufficient for measuring gene expression in organisms with well-annotated genomes, as with a reference transcriptome, it is possible to infer orientation for most of the sequencing reads. As there are fewer steps than stranded library preparation, the benefits of this approach are lower cost, simpler execution, and greater recovery of material, which renders non-stranded RNA-Seq the preferred option for holo-omic analyses. In contrast, stranded RNA-Seq is useful if the aims include annotating genomes, identifying antisense transcripts or discovering novel transcripts.
#### cDNA conversion {- #cDNA-conversion}
Most RNA-Seq experiments are carried out on instruments that sequence DNA molecules, rather than RNA. This implies that RNA conversion to cDNA is a required step before library preparation. The synthesis of cDNA from an RNA template is carried out via reverse transcription using reverse transcriptases. In nature, these enzymes convert the viral RNA genome into a complementary DNA (cDNA) molecule, which can integrate into the host’s genome, among other processes.
Reverse transcription, similar to PCR, requires the use of primers. Two main types of primers:
* **Random primers**: this type of primers are oligonucleotides with random base sequences. They are often six nucleotides long and are usually referred to as random hexamers. While random primers help improve cDNA synthesis for detection, they are not suitable for full-length reverse transcription of long RNA. Increasing the concentration of random hexamers in reverse transcription reactions improves cDNA yield but results in shorter cDNA fragments due to increased binding at multiple sites on the same template
* **oligo(dT) primers**: this type of primers consist of a stretch of 12–18 deoxythymidines that anneal to poly(A) tails of eukaryotic mRNAs (see the section below for further details).
Reverse transcription reactions for cDNA library construction and sequencing involve two main steps: first-strand cDNA synthesis and second-strand cDNA synthesis.
* **First-strand cDNA synthesis**: this initial step generates a cDNA:RNA hybrid through the below-described three-step process.
* **Primer annealing**: in this step primers are attached to the RNA template, which usually happens before reverse transcriptase and necessary components (e.g., buffer, dNTPs, RNase inhibitor) are added.
* **DNA polymerisation**: in this step the complementary DNA is polymerised by the reverse transcriptase enzyme. With oligo(dT) primers (Tm ~35–50°C), the reaction is often incubated directly at the optimal temperature of the reverse transcriptase (37–50°C), while random hexamers typically have lower Tm (~10–15°C) due to their shorter length. Using a thermostable reverse transcriptase allows, a higher reaction temperature (e.g., 50°C), to help denature RNA with high GC content or secondary structures without impacting enzyme activity. With such enzymes, high-temperature incubation can result in an increase in cDNA yield, length, and representation.
* **Enzyme deactivation**: in this final step temperature is increased to 70–85°C, depending upon the thermostability of the enzyme, to deactivate the reverse transcriptase.
* **Second-strand cDNA synthesis**: in this second step the first-strand cDNA is used as a template to generate double-stranded cDNA representing the RNA targets. Synthesis of double-stranded cDNA often employs a different DNA polymerase to produce the complementary strand of the first cDNA strand.
#### rRNA depletion through poly-A enrichment {- #rrna-depletion-polya}
Ribosomal RNA (rRNA) helps translate the information in messenger RNA (mRNA) into protein. It is the predominant form of RNA found in most cells, which can make over 80% of cellular RNA despite never being translated into proteins itself. In consequence, most reads derived from RNA belong to rRNAs, unless depletion strategies are implemented.
Excessively abundant rRNA sequences can be depleted using multiple strategies, which are covered in the [Microbial metatranscriptomics](#library-microbial-metatranscriptomics). The most broadly employed enrichment strategy when dealing with eukaryotic organisms is rRNA depletion through poly-A enrichment. This strategy relies on the fact that mature coding mRNAs of eukaryotic organisms contain polyA tails, long chains (tens to hundreds) of adenine nucleotides that are added to primary RNA transcripts to increase the stability of the molecule. However, not all transcripts contain poly(A) tails. microRNAs, small nucleolar RNAs (snoRNAs), transfer RNAs (tRNAs), some long non-coding RNAs (lncRNAs), and even protein-coding mRNAs such as histone mRNAs do not contail poly(A) tails, thus will be removed together with rRNA during poly(A) selection. If interested in quantifying expression of such transcripts the use of [alternative methods](#library-microbial-metatranscriptomics) is recommended.
The most broadly employed strategies to deplete rRNA through poly-A enrichment rely either on hybridisation with Oligo(dT)-attached magnetic beads or oligo(dT) priming during cDNA conversion step. In the former strategy, poly(A)-containing RNA molecules hybridise with Oligo(dT) stretches attached to magnetic beads. Following hybridisation, the supernatant consisting of non-polyadenylated molecules is removed. The beads are washed prior to elution of the poly(A)-selected RNA in water or buffer.
### List of available protocols {- #rrna-polya-depletion-protocols}
|Type|Name|Author/owner|Protocol/Article|
|---|---|---|---|
|oligo(dT) hybridisation|Dynabeads Oligo (dT)25-61005|Thermo Fisher|[Protocol](https://www.thermofisher.com/dk/en/home/references/protocols/nucleic-acid-purification-and-analysis/mrna-protocols/dynabeads-oligo-dt-25-61002.html)|
|oligo(dT) priming| | | |
:::: {.authorbox}
Contents of this section were created by [Antton Alberdi](#antton-alberdi).
:::
## Microbial metatranscriptomics {#library-microbial-metatranscriptomics}
Sequencing library preparation for microbial metatranscriptomics faces the same challenges as host genomics, but the fact that prokaryotic mRNA have no poly-A tails makes it impossible to apply oligo(dT)-based rRNA depletion strategies. There are three other alternatives through which prokaryotic rRNA can be depleted. These three strategies require designing oligos, probes or guides whose sequences complement the DNA sequences that should be removed. Most commercial kits contain probes designed to remove rRNA sequences of the most commonly employed animal hosts (Human/Mouse/Rat), as well as bacteria, but custom probes targeting any genes could be employed. The first two methods shown below are implemented before library-preparation, thus independent reactions must be ran for each sample. The last strategy is implemented after library preparation, which enables multiple indexed libraries to be pooled, and thus performing a single reaction per pool.
### Capture-based rRNA depletion {- #capture-based-rrna-depletion}
This method relies on capture rRNA with complimentary oligos that are coupled to paramagnetic beads. Unwanted transcripts get bound to beads, which can then be retained using a magnet, while the non-hybridasing transcripts remain in the elution.
### RNAse-based rRNA depletion {- #rnase-based-rrna-depletion}
A more recent technological upgrade to capture-based rRNA depletion is to, instead of using paramagnetic beads, degrade RNA:DNA hybrids using RNase H [@Huang2020-xf].
### CRISPR/Cas9-based rRNA depletion {- #cas9-based-rrna-depletion}
The newest method of all three relies on the DNA claveage capacity of the Cas9 enzyme [@Gu2016-yf]. In this method, custom-designed guides are used for the Cas9 enzyme to cleveage unwanted sequences. This strategy is applied once libraries are prepared, and before the final PCR amplification is conducted. When the targetted molecules are cleaveged, they lack one of the two adaptors, and therefore they are not amplified, resulting in a considerable depletion compared to the rest of the library.
### List of available protocols {- #rrna-depletion-protocols}
|Name|Strategy|Author/owner|Protocol/Article|
|---|---|---|---|
| Custom capture-based depletion | Capture-based | Open Source | [Article](https://www.nature.com/articles/s41598-019-48692-2) [@Kraus2019-fq] |
| Legacy Ribo-Zero | Capture-based | Illumina | |
| Custom RNAse-based depletion | RNAse-based | Open Source | [Article](https://academic.oup.com/nar/article/48/4/e20/5687826) [@Huang2020-xf] |
| Ribo-Zero Plus | RNAse-based | Illumina | |
| NEBNext® rRNA Depletion Kit | RNAse-based | NEB | |
| DASH | Cas9-based | Open Source | [Article](https://rnajournal.cshlp.org/content/26/8/1069.full) [@Prezza2020-ln] |
:::: {.authorbox}
Contents of this section were created by [Antton Alberdi](#antton-alberdi).
:::