forked from UCLouvain-CBIO/WSBIM2122
-
Notifications
You must be signed in to change notification settings - Fork 0
/
60-ms.Rmd
540 lines (397 loc) · 19.1 KB
/
60-ms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
# Mass spectrometry {#sec-ms}
The **objectives** of this chapter it to learn how mass spectrometry
works and how mass spectrometry data is structured. At the end,
students should be familiar with
- Acquisition of spectra over retention time
- Spectra and chromatograms
- MS1 and MS2 spectra and their relation
- Precursor and fragment ions
- Peptide-spectrum matching
- An understanding of the various MS1, MS2, labelled and label-free
quantitative approaches.
```{r, message = FALSE, echo = FALSE}
library("tidyverse")
library("Spectra")
library("magrittr")
library("ggplot2")
library("PSMatch")
```
Proteins as the functional units in all living organisms, and they are
highly dynamic. The caterpillar and the resulting butterfly have the
same genome. The complement of all the expressed proteins, termed the
proteome is however very different.
There are different modalities of the proteome that are of
interest. In addition to the presence or amount of protein in a
biological samples, it is also important to study the interactions
between proteins forming protein-protein complexes, the presence of
post-transcriptional modification (such as, for example,
phosphorylations), the rate at which proteins are produced and
degraded, or where the proteins reside inside a cell.
The technique of choice to study proteins in a high throughput way is
*mass spectrometry*.
For approachable introductions to sample preparation, mass
spectrometry, data interpretation and analysis, readers are redirected
to:
- *A beginner's guide to mass spectrometry–based proteomics* [@Sinha:2020]
- *The ABC's (and XYZ's) of peptide sequencing* [@Steen:2004]
- *How do shotgun proteomics algorithms identify proteins?* [@Marcotte:2007]
- *An Introduction to Mass Spectrometry-Based Proteomics* [@Shuken:2023]
## Setup
```{r sticker, fig.cap = "The `Spectra` package.", out.width = '50%', fig.margin=TRUE, echo=FALSE}
knitr::include_graphics("figs/Spectra.png")
```
We are going to use the Bioconductor [@Huber:2015] [`Spectra`
package](https://rformassspectrometry.github.io/Spectra/articles/Spectra.html),
which can be install with the `BiocManager` package, available from
CRAN. We will also make use of some test data from the `msdata`
package. Identification data will be handled by the
[`PSMatch`](https://rformassspectrometry.github.io/PSMatch/) package.
```{r, eval = FALSE}
BiocManager::install("msdata")
BiocManager::install("ProtGenerics")
BiocManager::install("Spectra")
BiocManager::install("PSMatch")
```
## How does mass spectrometry work?
Mass spectrometry (MS) is a technology that *separates* charged
molecules (ions) based on their mass to charge ratio (M/Z). It is
often coupled to chromatography (liquid LC, but can also be gas-based
GC). The time an analytes takes to elute from the chromatography
column is the *retention time*.
```{r chrom, fig.cap="A chromatogram, illustrating the total amount of analytes over the retention time.", echo=FALSE, out.width = "100%"}
knitr::include_graphics("./figs/chromatogram.png")
```
An mass spectrometer is composed of three components:
1. The *source*, that ionises the molecules: examples are Matrix-assisted
laser desorption/ionisation (MALDI) or electrospray ionisation.
(ESI)
2. The *analyser*, that separates the ions: Time of flight (TOF) or Orbitrap.
3. The *detector* that quantifies the ions.
When using mass spectrometry for proteomics, the proteins are first
digested with a protease such as trypsin. In mass shotgun proteomics,
the analytes assayed in the mass spectrometer are peptides.
Often, ions are subjected to more than a single MS round. After a
first round of separation, the peaks in the spectra, called MS1
spectra, represent peptides. At this stage, the only information we
possess about these peptides are their retention time and their
mass-to-charge (we can also infer their charge be inspecting their
isotopic envelope, i.e the peaks of the individual isotopes, see
below), which is not enough to infer their identify (i.e. their
sequence).
In MSMS (or MS2), the settings of the mass spectrometer are set
automatically to select a certain number of MS1 peaks (for example
20). Once a narrow M/Z range has been selected (corresponding to one
high-intensity peak, a peptide, and some background noise), it is
fragmented (using for example collision-induced dissociation (CID),
higher energy collisional dissociation (HCD) or electron-transfer
dissociation (ETD)). The fragment ions are then themselves separated
in the analyser to produce a MS2 spectrum. The unique fragment ion
pattern can then be used to infer the peptide sequence using de novo
sequencing (when the spectrum is of high enough quality) of using a
search engine such as, for example Mascot, MSGF+, ..., that will match
the observed, experimental spectrum to theoratical spectra (see
details below).
```{r ms2, fig.cap="Schematics of a mass spectrometer and two rounds of MS.", echo=FALSE, out.width = "100%"}
knitr::include_graphics("figs/SchematicMS2.png")
```
The animation below show how 25 ions different ions (i.e. having
different M/Z values) are separated throughout the MS analysis and are
eventually detected (i.e. quantified). The final frame shows the
hypothetical spectrum.
```{r mstut, fig.cap="Separation and detection of ions in a mass spectrometer.", echo=FALSE, out.width = "80%"}
knitr::include_graphics("figs/mstut.gif")
```
See also the Johannes Rainer's animations in the
[SpectraTutorials](https://jorainer.github.io/SpectraTutorials/articles/analyzing-MS-data-from-different-sources-with-Spectra.html)
vignette.
The figures below illustrate the two rounds of MS. The spectrum on the
left is an MS1 spectrum acquired after 21 minutes and 3 seconds of
elution. 10 peaks, highlited by dotted vertical lines, were selected
for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a
red (orange) vertical line on the MS1 spectrum and the fragment
spectra are shown on the MS2 spectrum on the top (bottom) right
figure.
```{r ms, fig.cap="Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).", echo=FALSE, out.width = "100%"}
knitr::include_graphics("figs/MS1-MS2-spectra.png")
```
The figures below represent the 3 dimensions of MS data: a set of
spectra (M/Z and intensity) of retention time, as well as the
interleaved nature of MS1 and MS2 (and there could be more levels)
data.
```{r ms1s, fig.cap="MS1 spectra (blue) over retention time (left). MS2 spectra (pink) interleaved between two MS1 spectra (right),", echo=FALSE, fig.fullwidth=TRUE, out.width = "90%"}
knitr::include_graphics("figs/ms3D.png")
```
## Reading and accessing MS data
Let's start by accessing an MS raw data file from the `msdata`
package. The `proteomics()` function checks the package content for a
file matching the `"2014"` pattern and returns its full name.
```{r}
basename(f <- msdata::proteomics(pattern = "2014", full.names = TRUE))
```
There exist a set of community-maintained file formats for mass
spectrometry data. The most widely used formats are `mzML` and
`mzXML`, both relying on the
[XML](https://www.w3schools.com/xml/default.asp) markup
language. These two formats, as well as others, can be read into R as
`Spectra` objects:
```{r Spectra}
library("Spectra")
ms <- Spectra(f)
ms
```
Once a `Spectra` object has been created, it can be queried and
manipulated using a whole set of accessor functions, documented in the
[`Spectra`-class
documentation](https://rformassspectrometry.github.io/Spectra/reference/Spectra.html).
`Spectra` objects can be subset with the `[` operator. In addition to
the raw data (the M/Z and intensity values) that compose the
individual scans, the annotation of each spectrum is available as a
single and large table called *spectraData*, composed of a set of
*spectraVariables* columns.
```{r, fig.cap = "Schematic representation of what is referred to by MS data: a collection of spectra and a table containing spectrum-level annotations.", fig.fullwidth=TRUE, echo=FALSE}
knitr::include_graphics("figs/spectra-schematics.png")
```
`r msmbstyle::question_begin()`
Create a `Spectra` object as described above and reply to the
following questions:
- Use the `length()` function to check how many spectra have been
acquired in this MS run?
- How many MS levels are there in this run? You can use the
`msLevel()` accessor to get the MS level of all the scans.
- How many spectra of each MS level are there?
- For how long was the data acquired? Use the `rtime()` accessor to
obtain the retention times.
- Which MS2 spectrum has the highest precursor intensity? The
precursor intensities can be retrieved with the
`precursorIntensity()` function.
- What is that's spectrum retention time and precuror M/Z? Use the
`precursorMz()` and `rtime()` accessors.
- Visualise that spectrum using using the `plotSpectra()`
function. Reproduce that figure youself by accessing the scan's raw
data with the `intensity()` and `mz()` functions.
`r msmbstyle::question_end()`
The figure below show is an illustration of how mass spectrometry
works:
1. The chromatogram at the top display to total ion current along the
retention time. The vertical line identifies one scan in particular
at retention time 1800.68 seconds (the 2807th scan).
2. The spectra on the second line represent the full MS1 spectrum
marked by the red line. The vertical lines identify the 10
precursor ions that where selected for MS2 analysis. The zoomed in
on the right shows one specific precursor peak.
3. The MS2 spectra displayed along the two rows at the bottom are
those resulting from the fragmentation of the 10 precursor peaks
identified by the vertical bars above.
```{r, fig.cap = "", out.width = "100%", echo = FALSE}
knitr::include_graphics("figs/msvisfig.png")
```
`r msmbstyle::question_begin()`
Reproduce the figure above.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
1. The chromatogram can be created by extracting the `totIonCurrent`
and `rtime` variables for all MS1 spectra:
```{r}
with(spectraData(filterMsLevel(ms, 1)),
plot(rtime, totIonCurrent, type = "l"))
abline(v = rtime(ms)[2807], col = "red")
```
2. The `filterPrecursorScan()` function can be used to retains parent
(MS1) and children scans (MS2) of a scan, as defined by its
acquisition number.
```{r}
ms_2 <- filterPrecursorScan(ms, 2807)
plotSpectra(ms[2807], xlim = c(400, 1000))
abline(v = precursorMz(ms_2)[-1], col = "grey")
abline(v = precursorMz(ms_2)[2], col = "red")
```
```{r}
plotSpectra(ms[2807], xlim = c(521.2, 522.5), type = "l")
abline(v = precursorMz(ms_2)[2], col = "red")
```
3. The `plotSpectra()` function is used to plot all 10 MS2 spectra in
one call.
```{r, fig.height = 12, fig.width = 8}
plotSpectra(ms_2[-1])
```
`r msmbstyle::solution_end()`
## Identification: matching spectra to peptides
The raw data is still a long way of obtaining biologically relevant
proteomics data. The first step is to identify the peptides that have
been acquired in the MS. Peptide identification works by comparing
expected and observed spectra. As shown below, when a precursor
peptide ion is fragmented in a CID cell, it breaks at specific bonds,
producing sets of peaks (*a*, *b*, *c* and *x*, *y*, *z*) that can be
predicted.
```{r frag, results='markup', fig.margin=FALSE, fig.cap="Peptide fragmentation.", echo=FALSE, out.width = "80%"}
knitr::include_graphics("figs/frag.png")
```
It is thus possible to calculate the expected set of fagment peaks for
a given peptide, such as *THSQEEMQHMQR* below and match these
theoretical fragments to those observed in MS2 spectra.
```{r, message = FALSE}
library("PSMatch")
calculateFragments("THSQEEMQHMQR")
```
In a full experiment, all possible peptides from the proteome of
interest (such as databases that can be downloaded from the [UniProt
site](https://www.uniprot.org/)[^uniprot]) are compared to the
millions of observed spectra.
[^uniprot]: The Universal Protein Resource (UniProt) is a freely and
accessible comprehensive resource for protein sequence and
annotation data.
The matching between millions of observed and possible spectra causes
real challenges due to the large search space and the risk of false
positives. See [@Kall:2008] for further reading.
### Handling identification results
Identification results are also stored in an XML-based file format,
called `mzIdentML` (or `mzId` for short). Below, we access such a file
available in the `msdata` package and load its content into R with the
`PSMs::PSM()` function.
```{r readPSMs}
basename(idf <- msdata::ident(full.names = TRUE))
id <- PSM(idf)
head(id)
```
`r msmbstyle::question_begin()`
How many of PSM entries originate from the real and decoy databases?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
table(id$isDecoy)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Visualise the `MS.GF.RawScore` densitues for the real and decoy
hits. Interprete the figures.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
library("ggplot2")
data.frame(id) %>%
ggplot(aes(x = MS.GF.RawScore, colour = isDecoy)) +
geom_density()
```
`r msmbstyle::solution_end()`
Before proceeding, we want to only focus on relevant peptides. The
`filterPSMs()` function will remove decoy peptides, only keep PSMs of
rank 1 (i.e. the best respective matches), and remove peptides that
match multiple proteins[^noteID].
[^noteID]: A more rigorous processing of these identifications would
be required, for instance a way to evalutation the false discovery
rate in these hits (which we know from above is very high). The
`r BiocStyle::Biocpkg("MSnID")` is highly recommended.
```{r}
id <- filterPSMs(id)
```
For simplicity, we are also going to ignore the location of the
peptide modifications by removing the `modLocation` column.
```{r}
id$modLocation <- NULL
id <- unique(id)
```
Now that we have matching raw data (the `ms` variable) and
identification data (the `id` variable), it is possible to join the
latter to the former's spectra data:
```{r joinSpectraData}
ms <- joinSpectraData(ms, id,
by.x = "spectrumId",
by.y = "spectrumID")
ms
```
We are going to conlude this part by annotating an MS2 spectrum with
the calculated peptide fragment.
`r msmbstyle::question_begin()`
- Choose two MS2 spectra that have an `MS.GF.RawScore` greater than
100 and are matched to the same peptide sequence.
- Then visualise them annotating their fragment ion with its fragment
ions by passing the `addFragments` function to `plotSpectra`'s
`labels` argument.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, fig.height = 9, fig.width = 7, message = FALSE, fig.cap = "Two MS2 spectra that were matched to the `TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR` peptide."}
seq <- as_tibble(spectraData(ms)) %>%
filter(`MS.GF.RawScore` > 100) %>%
group_by(sequence) %>%
summarise(n = n()) %>%
filter(n == 2) %>%
head(1) %>%
pull(sequence)
i <- which(ms$sequence == seq)
plotSpectra(ms[i], labels = addFragments,
labelPos = 3, labelCol = "orange",
main = ms[i]$sequence)
```
For a more direct comparision, the two spectra can also be represented
on the same graph pointing upwards and downwards:
```{r, fig.height = 5, fig.width = 7, message = FALSE, warning = FALSE, fig.cap = "Mirror representation of to matches to the `TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR` peptide."}
plotSpectraMirror(ms[i[1]], ms[i[2]],
labels = addFragments,
labelPos = 3, labelCol = "orange")
```
`r msmbstyle::solution_end()`
From the list of identified peptides, it is then necessary to infer
the most provable proteins that were present in the biological
sample. **Protein inference** is a difficult task, as peptides often
match multiple proteins (either different isoforms or proteins
stemming from different gene but with identical domains), which leads
to the definition of protein groups, i.e. sets of proteins that can't
be distinguished with the set of identified peptides at hand. See
[@Nesvizhskii:2005] for further reading.
## Quantitative proteomics
The last step of MS data processing is to quantify peptide abundances
in the biological samples. The table below summarises the different
possibilites depending whether the proteins or peptides are labelled,
and whether the quantitation is performed in MS1 or MS2.
| |Label-free |Labelled |
|:---|:----------|:----------|
|MS1 |XIC |SILAC, 15N |
|MS2 |Counting |iTRAQ, TMT |
### Label-free MS2: Spectral counting
In spectral counting, on simply counts the number of quantified
peptides that are assigned to a protein.
```{r sc, echo=FALSE, out.width = "75%", fig.cap = "Spectral counting. Figure from the `Pbase` package."}
knitr::include_graphics("./figs/pbase.png")
```
### Labelled MS2: Isobaric tagging
Isobaric tagging refers to the labelling using isobaric tags,
i.e. chemical tags that have the same mass and hence can't be
distinguish by the spectrometer. The peptides of different samples (4,
6, 10, 11 or 16) are labelled with different tags and combined prior
to mass spectrometry acquisition. Given that they are isobaric, all
identical peptides, irrespective of the tag and this the sample of
origin, are co-analysed, up to fragmentation prior to MS2
analysis. During fragmentation, the isobaric tags fall of, fragment
themselves, and result in a set of sample specific peaks. These
specific peaks can be used to infer sample-specific quantitation,
while the rest of the MS2 spectrum is used for identification.
```{r itraq, echo=FALSE, out.width = "75%", fig.cap = "iTRAQ 4-plex isobaric tagging. Tandem Mass Tags (TMT) offer up to 16 tags."}
knitr::include_graphics("./figs/itraq.png")
```
### Label-free MS1: extracted ion chromatograms
In label-free quantitation, the precursor peaks that match an
identified peptide are integrated of retention time and the area under
that *extracted ion chromatogram* is used to quantify that peptide in
that sample.
```{r lf, echo=FALSE, out.width = "75%", fig.cap = "Label-free quantitation. Figure credit [Johannes Rainer](https://github.com/jorainer/)."}
knitr::include_graphics("./figs/chrompeaks.png")
```
### Labelled MS1: SILAC
In SILAc quantitation, sample are grown in a medium that contains
heavy amino acids (typically arginine and lysine). All proteins gown
in this *heavy* growth medium contain the heavy form of these amino
acids. Two samples, one grown in heavy medium, and one grown in normal
(light) medium are then combined and analysed together. The heavy
peptides precursor peaks are systematically shifted compared to the
light ones, and the ratio between the height of a heavy and light
peaks can be used to calculate peptide and protein fold-changes.
```{r silab, echo=FALSE, out.width = "75%", fig.cap = "Silac quantitation. Figure credit Wikimedia Commons."}
knitr::include_graphics("./figs/Silac.png")
```
These different quantitation techniques come with their respective
benefits and distinct challenges, such as large quantities of raw data
processing, data transformation and normalisation, missing values, and
different underlying statistical models for the quantitative data
(count data for spectral counting, continuous data for the others).
The next chapter will focus on the handling and processing of
quantitative proteomics data.