generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathTargetDecoyTutorial.Rmd
222 lines (154 loc) · 13.7 KB
/
TargetDecoyTutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Introduction to TargetDecoy"
author:
- name: Lieven Clement
affiliation:
- Ghent University
output:
html_document:
code_download: true
theme: flatly
toc: true
toc_float: true
highlight: tango
number_sections: true
linkcolor: blue
urlcolor: blue
citecolor: blue
---
```{r setup, include=FALSE, class.source = 'fold-hide'}
library(knitr)
opts_chunk$set(
collapse = TRUE,
comment = "#>",
crop = NULL,
fig.width = 6,
dpi = 72
)
```
```{r "libraries", message=FALSE, warning=FALSE, class.source = 'fold-hide'}
library(TargetDecoy)
library(ggplot2)
library(tidyverse)
data("ModSwissXT")
hlp <- TargetDecoy:::.getDF(ModSwissXT)
names(hlp) <- names(hlp) %>%
str_replace(pattern=":",replacement="_")
hlp <- hlp %>%
mutate(omssa_evalue=as.double(omssa_evalue))
names(hlp) <- names(hlp) %>% str_replace(pattern="-",replacement = "_")
hlp <- hlp %>% mutate(ms_gf_specevalue=as.double(ms_gf_specevalue))
```
# Intro
Shotgun proteomics relies on the assignment of a large number of spectra to theoretical peptides derived from a sequence database. Multiple search engines have been developed for this task, each with its own advantages and drawbacks. Most proteomics database searches are performed as so-called target/decoy searches. A crucial assumption of the target/decoy approach is that the decoy peptide-spectral
match (PSM) hits have similar properties as bad target hits so that the decoys can be used to characterize the distribution of bad hits. In this tutorial we will introduce diagnostic plots that can be used to evaluate these assumptions.
## Basic Statistical Concepts
We first introduce some notation. With x we denote the PSM score and we assume that larger score values indicate a better match to the theoretical spectrum. Then the scores will follow a mixture distribution:
$$ f(x)=\pi_b f_b (x)+(1-\pi_b ) f_g(x), $$
with $f(x)$ the target PSM score distribution, $f_b(x)$ the mixture component corresponding to incorrect PSMs, $f_g(x)$ the mixture component corresponding to the correct PSMs and $\pi_b$ the fraction of incorrect PSMs.
Based on the mixture distribution we can calculate the posterior probability that a PSM with score x is a bad match:
$$ P[\text{Bad hit} \vert \text{score }x]=\frac{\pi_b f_b (x)}{f(x)}, $$
which is also referred to as the posterior error probability (PEP) in mass spectrometry based proteomics.
Based on the mixture model, we can also calculate the posterior probability that a random PSM in the set of all PSMs with scores above a score threshold t is a bad hit (see e.g. Figure 1):
$$ P[\text{Bad hit} \vert \text{score }x>t]=\pi_b \frac{1-f_b (t)}{1-F(t)}, $$
With $F(t)$ and $F_b(t)$ the cumulative distribution functions of all target PSMs and of the bad PSMs, respectively. Hence, $1-F_b(t)$ is the probability to observe a bad PSM hit above the threshold and, $1-F(t)$ the probability to observe a target PSM hit above the threshold. The probability $P[\text{Bad hit} \vert \text{score }x>t]$ is also referred to as the false discovery rate (FDR) of the set of PSMs with scores above the threshold t. Hence, the FDR has the interpretation of the probability on a bad hit in the set of all target hits that are returned in the final PSM list.
```{r class.source = 'fold-hide'}
library(mgcv)
dec <- -log10(hlp$ms_gf_specevalue[hlp$isdecoy]) %>% na.exclude()
tar <- -log10(hlp$ms_gf_specevalue[!hlp$isdecoy]) %>% na.exclude()
breaks <- seq(0,30,.5)
#binWidth <-2
#breaks <- seq(floor(min(c(dec,tar))/binWidth)*binWidth,ceiling(max(c(dec,tar))/binWidth)*binWidth,binWidth)
#code if we register the modes by substracting the mode from the target scores and the decoy scores.
#breaks=seq(-(ceiling(abs(min(c(dec,tar))/binWidth))+.5)*binWidth,(ceiling(max(c(dec,tar))/binWidth)+.5)*binWidth,binWidth)
histDec <- hist(dec,breaks=breaks,plot = FALSE)
histTar <- hist(tar,breaks=breaks,plot=FALSE)
histSam <- hist(c(dec,tar),breaks=breaks, plot = FALSE)
grid<-seq(0,30,.1)
countsTarG<-data.frame(y=histTar$counts-histDec$counts,x=histTar$mids)
countsTarG$y[countsTarG$y<0]<-0
fitTarG<-gam(y~s(x),data=countsTarG,family=poisson)
fitTarGrid<-exp(predict(fitTarG,newdata=data.frame(x=grid)))
countsDec<-data.frame(y=histDec$counts,x=histDec$mids)
fitDec<-gam(y~s(x),data=countsDec,family=poisson)
fitBad<-exp(predict(fitDec,newdata=data.frame(x=grid)))
plot(histTar,xlab="MS-GF+ Score",ylab="# PSMs",main="Pyrococcus Search",border="white",col="grey",cex.axis=1.5,cex.main=1.5,cex.lab=1.5,ylim=c(0,1500), axes =FALSE)
axis(side=2,at=c(0,750,1500))
axis(side=1,at=c(0,10,20,30))
lines(grid,fitBad+fitTarGrid,col="black",lwd=2)
lines(grid,fitBad,col="red",lwd=2)
lines(grid,fitTarGrid,col="blue",lwd=2)
lines(histDec$mids,histDec$counts,col="red",type="h")
legend("topright",legend=c("targets","decoys"),col=c("grey","red"), lty =rep(1,2))
```
We would like to calculate the FDR corresponding to the set op PSMs with a target score above the threshold t.
In order to calculate the FDR, we thus have to characterize the distribution of the bad hits and of all PSMs.
In proteomics this is done by the use of the target/decoy approach.
## Target Decoy Approach
When using a competitive target decoy search, the FDR of the set of returned PSMs is estimated by dividing the number of accepted decoys PSMs by the number of accepted target PSMs above a certain score cutoff [1].
$$ \widehat{\text{FDR}}(t)=\frac{\# decoys | x>t}{\#targets |x>t} $$
This can be rewritten as:
$$ \widehat{\text{FDR}}(t)=\frac{\#decoys}{\#targets}\frac{\frac{\# decoys | x>t}{\#decoys}}{\frac{\#targets |x>t}{\#targets}} $$
$$ \widehat{\text{FDR}}(t)=\hat\pi_b\frac{1-\bar{F}_0(t)}{1-\bar{F}(t)} $$
Hence, the proportion of bad hits \\( \pi_b \\) is estimated as the number of decoys divided by the number of targets, and the competitive TDA assumes that it is equally likely that a bad hit matches to a bad target or to a decoy; the probability of a (bad) target PSM hit above the threshold is estimated based on the empirical cumulative distribution in the sample, i.e. as the fraction of targets (decoys) that are above the threshold. Hence, a second assumption is that the decoy matches provide a good simulation of the target matches. See e.g. [2]. These assumptions can be evaluated with our TargetDecoy R/Bioconductor package.
# Example: MSGF+ search on Swiss Prot
The Pyrococcus furiosus (strain ATCC 43587 / DSM 3638 / JCM 8422 / Vc1) reference proteome. The resulting database has 2,051 proteins in total ([https://www.uniprot.org/uniprot/?query=taxonomy:186497](https://www.uniprot.org/uniprot/?query=taxonomy:186497), taxonomy:"Pyrococcus furiosus (strain ATCC 43587 / DSM 3638 / JCM 8422 / Vc1) [186497]").
The data can be found on [https://github.com/statOmics/PDA22GTPB/tree/data](https://github.com/statOmics/PDA22GTPB/tree/data).
We will use the mzid file for the pyrococcus example, which can be found at data/identification/pyrococcusMSGF+.mzid.
1. Download the code of this vignette by clicking the code button in the top of this page and downloading.
2. Open the downloaded file in Rstudio
3. Load libraries
```{r}
library(TargetDecoy)
library(RCurl)
library(mzID)
```
4. Download data in your working directory. You can skip this step if you downloaded all data on your computer.
```{r}
download.file("https://raw.githubusercontent.com/statOmics/PDA22GTPB/data/identification/pyrococcusMSGF%2B.mzid","pyrococcusMSGF+.mzid")
```
5. Set a path to your working directory and import the mzID file in R.
```{r}
path2File <- "pyrococcusMSGF+.mzid"
msgf <- mzID(path2File)
```
6. Launch the evalTargetDecoys Shiny Gadget. Note, that you have to erase the eval=FALSE option in order to run the code.
```{r eval=FALSE}
evalTargetDecoys(msgf)
```
7. You can immediately make the plots in the script using the code below
```{r}
evalTargetDecoys(msgf,"isdecoy","ms-gf:evalue")
```
Two diagnostic plots are generated. A histogram of target and decoy scores (top panels) and Probability-Probability plots (PP-plots). In the histogram the shape of the decoys (red) should be equal to that of bad target hits (first mode in the target distribution indicated in green). The height of the decoys can be slightly lower than the first mode in the target distribution because some good target hits also have a low score. Here, the histogram does not indicate problems with the TDA.
Deviations from the assumptions of TDA can be better evaluated in the PP-plot (lower panel). The PP-plot displays the empirical cumulative distribution (ECDF) from the target distribution in function that of the decoy distribution. PP-plots have the property that they show a straight 45 degree line through the origin if and only if both distributions are equivalent. Any deviation from this straight line indicates that the distributions differ. For targets and decoys this is obviously the case. The target distribution is stochastically larger than the decoy distribution as larger scores indicate more reliable hits and as decoys are believed to behave similarly to bad target hits. Hence, the PP-plot for targets vs decoys will always lay below the 45 degree line. When the decoys are a good simulation for the bad target hits, however, the lower values in the PP-plot should lay on a straight line that is located around $\hat{\pi}_b$. Indeed, at small target PSM scores are most likely bad hits. The slope of the black line in the p-plot is based on the estimate of $\hat{\pi}_b$. The first part of the PP-plot for the pyrococcus example is linear with a slope that equals $\hat{\pi}_0$. This indicates that the decoy distribution and the mixture component for incorrect PSMs of the target mixture distribution coincide. The second part of the plot deviates from the line towards higher percentiles because the upper tail of the mixture component for incorrect subset PSMs and the lower tail of the mixture component for correct subset PSMs overlap and therefore target PSMs occur at a higher probability at intermediate scores than in the decoy distribution. Finally, the PP-plot becomes vertical, because all the decoys have been observed (decoy percentile=1) before all the scores of the target PSMs are observed. If we see this profile in the PP-plot, we have a good indication that the set of decoys from the complete search is representative for the mixture component for incorrect PSMs of the target mixture distribution. Hence, the assumptions of the concatenated TDA approach are not violated for the Pyrococcus example.
When the assumptions of the concatenated TDA approach are violated, the dots in the PP-plot at lower percentiles will deviate from the $ \hat{\pi}_b$ line. In case the PP-plot is still a straight line at lower percentiles, then the shape of the decoy distribution is correct, but there are less (or more) decoys than expected under the concatenated TDA assumption, which could occur if the decoy database is different in size than the target database or when a bad hit is less likely to match to a decoy than to a target. This would also be visible in the histograms: the decoy histogram would be considerably lower (higher) than the first mode of the target distribution.
When the PP-plot at lower percentiles deviates from a straight line, the distribution of decoys and the bad target PSMs is not equivalent, indicating that the decoys are not a good simulation of bad target hits.
Both type of deviations should be of concern as they indicate that the FDR returned by the conventional concatenated TDA is incorrect.
# Exercises
1. All data can be downloaded via this link: [Download data](https://github.com/statOmics/PDA22GTPB/archive/refs/heads/data.zip).
2. Unzip the data
## Assess the search you performed in "Tutorial 1. Peptide and Protein Identification" at [https://compomics.com/bioinformatics-for-proteomics/identification/](https://compomics.com/bioinformatics-for-proteomics/identification/)
Open the search from tutorial 1.3. in Peptide Shaker (see your own results or resources for tutorial 1.4) and export the search to an mzid file by clicking export > Peptide Shaker Project As > mzIdentML. Evaluate the TDA for the ommsa, X!Tandem and the Peptide Shaker score.
Evaluate the TDA for the X!Tandem, OMSSA and Peptide Shaker scores. What do you observe and try to explain. [1.4.a]
1. Complete the code below by filling out the path to your file.
Note, that you have to erase the eval=FALSE option in order to run the code.
```{r eval=FALSE}
path2File <- "..."
mzidTutorialCompomics <- mzID(path2File)
```
2. Launch the Shiny Gadget and explore the results for the different search engines in your mzID file. Note, do not forget to erase the eval=FALSE to run the code.
```{r eval=FALSE}
evalTargetDecoys(mzidTutorialCompomics)
```
## Pyrococcus - Peptide Shaker - Uniprot search
Do the analysis for the search MSGF+, X!Tandem, OMSSA and Peptide Shaker scores based on all Pyrococcus proteins in a search against all pyrococcus peptides in Uniprot (data/identification/pyroUniprot.mzid). What do you observe explain. [1.4.b]
## Pyrococcus/Peptide Shaker - Swiss prot search
Do the analysis for the search MSGF+, X!Tandem, OMSSA and Peptide Shaker scores for Pyrococcus based on the curated proteins from swissprot only (data/identification/pyroSwissprot.mzid). What do you observe. Try to explain. [1.4.c]
## FDR Elias and Gygi, 2007
Elias and Gygi, 2007, reported the following target decoy FDR estimation:
$$\widehat{\text{FDR}}(t)=\frac {2 \times (\#decoys >t)}{\#decoys > t + \#targets > t}$$
Do you agree with this expression? Why, why not? [1.4.d]
# References
[1] Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007; 4:207--214.
[2] Sticker, A., Martens, L. and Clement, L. (2017). Mass spectrometrists should search for all peptides, but assess only the ones they care about. Nature Methods 14, 643--644.