-
Notifications
You must be signed in to change notification settings - Fork 0
/
modelbased_DEA.Rmd
382 lines (279 loc) · 20.3 KB
/
modelbased_DEA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
title: "**Differential Expression Analysis** strategy comparison for **Model-based** analysis of isobarically labeled proteomic data."
author: "Piotr Prostko, Joris Van Houtven"
date: '`r format(Sys.time(), "%B %d, %Y,%H:%M")`'
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
number_sections: true
theme: flatly
code_folding: "hide"
editor_options:
chunk_output_type: console
params:
input_data_p: 'data/input_data.rds'
suffix_p: ''
load_outputdata_p: TRUE
save_outputdata_p: FALSE
subsample_p: 0
---
```{r, setup, include=FALSE}
# default knitr options for all chunks
knitr::opts_chunk$set(
message=FALSE,
warning=FALSE,
fig.width=12,
fig.height=7
)
```
<span style="color: grey;">
_This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or 'workflow' can be divided up into different components; it is recommend you read more about that in the [introduction notebook](intro.html)._
</span>
In this notebook specifically, we investigate the effect of varying the Differential Expression testing component on the outcome of the differential expression results. The four component variants are: **linear mixed-effects model**, **DEqMS**, **ANOVA** applied to the **protein-level** data, and **ANOVA** applied to the **PSM-level** data.
<span style="color: grey;">
_The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the 'Code' button. Each code section can be expanded in a similar fashion. You can also download the [entire notebook source code](modelbased_unit.Rmd)._
</span>
```{r}
library(caret)
library(lme4)
library(lmerTest)
library(ggplot2)
library(stringi)
library(gridExtra)
library(ggfortify)
library(dendextend)
library(psych)
library(kableExtra)
library(tidyverse)
library(dtplyr)
library(DEqMS)
source('util/other_functions.R')
source('util/plotting_functions.R')
```
Let's load our PSM-level data set:
```{r}
data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
dat.w <- data.list$dat.w # data in wide format
display_dataframe_head(dat.l)
```
After the filtering done in `data_prep.R`, there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.
```{r}
# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}
```
We store the metadata in `sample.info` and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.
```{r}
# specify names of component variants
variant.names <- c('LMM', 'DEqMS', 'ANOVA_protein', 'ANOVA_PSM')
n.comp.variants <- length(variant.names)
# get some data parameters created in the data_prep script
referenceCondition <- data.list$data.params$referenceCondition
condition.color <- data.list$data.params$condition.color
ma.onesample.num <- data.list$data.params$ma.onesample.num
ma.onesample.denom <- data.list$data.params$ma.onesample.denom
ma.allsamples.num <- data.list$data.params$ma.allsamples.num
ma.allsamples.denom <- data.list$data.params$ma.allsamples.denom
# create data frame with sample information
sample.info <- get_sample_info(dat.l, condition.color)
# get channel names
channelNames <- remove_factors(unique(sample.info$Channel))
```
```{r}
display_dataframe_head(sample.info)
referenceCondition
channelNames
```
# Unit scale component: log2 transformation of reporter ion intensities
We use the default unit scale: the log2-transformed reporter ion intensities.
```{r}
dat.unit.l <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)
```
# Summarization component: no summarization
As a default approach (consult the manuscript or the [introduction notebook](intro.html)) we opt for no summarization, meaning that all PSM-level data is going to be exploited in further analyses. This also means that multiple variants of the same peptide within a sample carrying different charge, modifications or detected at different retention times are kept as is.
```{r}
# no summarization
dat.summ.l <- dat.unit.l
```
# Normalization component: linear mixed-effects model
The manuscripts of [Hill et al.](https://doi.org/10.1021/pr070520u) and [Oberg et al.](https://doi.org/10.1021/pr700734f) illustrated the application of linear models for removing various biases potentially present in isobarically labelled data. Inspired by this approach, we fit the following linear mixed-effect model, which corrects the observed reporter ion intensities $y_{i, j(i), q, l, s}$ for imbalance stemming from run $b_q$ and run-channel $v_{l(q)}$ fixed effects, as well as protein $p_i$ and run-protein $b_q \times f_{j(i)}$ random effects:
$$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + p_i + (b_q \times f_{j(i)}) + \varepsilon_{i, j(i), q, l, s} $$
where $p_i \sim N(0, \sigma_p^2),\, (b_q \times f_{j(i)}) \sim N(0, \sigma_f^2),\, \varepsilon_{i, j(i), q, l, s} \sim N(0, \sigma^2)$
The model is fitted using the `lmer()` function based on the REML criterion. Afterwards, the "subject-specific" residuals (which involve subtraction of the empirical bayes estimates of the random effects) of the model are treated as normalized values and used in further analyses.
```{r, eval=!params$load_outputdata_p}
dat.norm.l <- dat.summ.l
LMM1 <- lmer(response ~ Run + Run:Channel + (1|Protein) + (1|Run:Peptide), data=dat.summ.l)
dat.norm.l$response <- residuals(LMM1)
```
# QC plots
Before getting to the DEA section, let's do some basic quality control and take a sneak peek at the differences between the component variants we've chosen. First, however, we should make the data completely wide, so that each sample gets it's own unique column.
```{r, eval=!params$load_outputdata_p}
dat.nonnorm.summ.l <- aggFunc(dat.summ.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein', 'Peptide'), 'median')
dat.nonnorm.summ.l <- aggFunc(dat.nonnorm.summ.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein'), 'median')
dat.norm.summ.l <- aggFunc(dat.norm.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein', 'Peptide'), 'median')
dat.norm.summ.l <- aggFunc(dat.norm.summ.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein'), 'median')
# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- pivot_wider(data=dat.norm.summ.l, id_cols=Protein, names_from=Run:Channel, values_from=response, names_sep=':')
## non-normalized data
dat.nonnorm.summ.w2 <- pivot_wider(data=dat.nonnorm.summ.l, id_cols=Protein, names_from=Run:Channel, values_from=response, names_sep=':')
```
```{r, echo=FALSE, eval=params$load_outputdata_p}
load(paste0('modelbased_DEA_outdata', params$suffix_p, '.rda'))
```
## Boxplot
The normalization model consistently aligns the reporter ion intensity values.
```{r}
par(mfrow=c(1,2))
boxplot_ils(dat.nonnorm.summ.l, 'raw')
boxplot_ils(dat.norm.summ.l, 'normalized')
```
## MA plot
We then make MA plots of two single samples taken from condition `r ma.allsamples.num` and condition `r ma.allsamples.denom`, measured in different MS runs (samples *`r ma.onesample.num`* and *`r ma.onesample.denom`*, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes.
```{r}
p1 <- maplot_ils(dat.nonnorm.summ.w2, ma.onesample.num, ma.onesample.denom, scale='log', 'raw', spiked.proteins)
p2 <- maplot_ils(dat.norm.summ.w2, ma.onesample.num, ma.onesample.denom, scale='log', 'normalized', spiked.proteins)
grid.arrange(p1, p2, ncol=2)
```
To increase the robustness of these results, let's make some more MA plots, but now for all samples from condition `r ma.allsamples.num` and condition `r ma.allsamples.denom` (quantification values averaged within condition).
Both the unnormalized and normalized data now show less variability as using more samples (now 8 in both the enumerator and denominator instead of just one) in the fold change calculation makes the rolling average more robust. It also seems the spike-in proteins induce a small positive bias (blue curve is rolling average) for low abundance proteins.
```{r}
channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% distinct(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% distinct(Sample) %>% pull
p1 <- maplot_ils(dat.nonnorm.summ.w2, channels.num, channels.denom, scale='log', 'raw', spiked.proteins)
p2 <- maplot_ils(dat.norm.summ.w2, channels.num, channels.denom, scale='log', 'normalized', spiked.proteins)
grid.arrange(p1, p2, ncol=2)
```
## PCA plot
Now, let's check if these multi-dimensional data contains some kind of grouping; It's time to make PCA plots.
### Using all proteins
After normalization, samples are much more grouped according to dilution factor instead of run.
```{r}
par(mfrow=c(1, 2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% select(-'Protein'), info=sample.info, 'raw')
pcaplot_ils(dat.norm.summ.w2 %>% select(-'Protein'), info=sample.info, 'normalized')
```
There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small amount in both relative (to the 4083 proteins total) and absolute (for a biological sample) terms.
### Using spiked proteins only
Therefore, let's see what the PCA plots look like if we were to only use the spiked proteins in the PCA.
```{r, eval=length(spiked.proteins)>0}
par(mfrow=c(1, 2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, 'raw')
pcaplot_ils(dat.norm.summ.w2 %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, 'normalized')
```
Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins.
In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.
## HC (hierarchical clustering) plot
The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all. This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).
### Using all proteins
```{r}
par(mfrow=c(1, 2))
dendrogram_ils(dat.nonnorm.summ.w2 %>% select(-'Protein'), info=sample.info, 'raw')
dendrogram_ils(dat.norm.summ.w2 %>% select(-'Protein'), info=sample.info, 'normalized')
```
## Run effect p-value plot
Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. Below are the distributions of p-values from a linear model for the `response` variable with `Run` as a covariate. If the run effect was removed successfully, these p-values ought to be large.
```{r}
dat <- list(dat.nonnorm.summ.l,dat.norm.summ.l)
names(dat) <- c('raw','normalized')
run_effect_plot(dat)
```
# DEA component
A typical approach to Differential Expression Analysis, which we also employ here, assumes testing only one protein at a time. After a DEA analysis, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.
## linear (mixed-effects) model
We fit a linear mixed-effect model given by:
$$ w_{j(i), q, l, s} = m + r_c + z_{l(q)} + \eta_{j(i), c, q, l, s} $$
with $w$ as the normalized values (the subject-specific residuals of the normalization model), $m$ as the model intercept, $r_c$ as the difference in expression levels between the biological conditions, $z_{l(q)}$ as the random effect accounting for the potential correlation within each sample induced by the protein repeated measurements, and $\eta_{j(i),c,q,l,s}$ as the random error. Note the index $s$ which implies the PSM-level data (i.e. not aggregated data).
**Technical comment 1**: obtaining log fold changes corresponding to the contrasts of interest when working with log intensities and 'treatment' model parametrization (i.e., model intercept represents the reference condition) is immediately straightforward: these are coefficients corresponding to the $r_c$ effect.
**Technical comment 2**: while introducing the $z{(l(q)}$ random effect into the DEA model is justified, not every protein will have enough repeated measurements (i.e., multiple peptides and/or PSMs corresponding to different peptide modifications, charge states and retention times) for the random effect being estimable. However, in such cases the fixed effect is estimable and its inference remain valid.
```{r, eval=!params$load_outputdata_p}
dat.dea <- emptyList(variant.names)
dat.dea$LMM <- lmm_dea(dat=dat.norm.l, mod.formula='response ~ Condition + (1|Run:Channel)', referenceCondition, scale='log')
```
```{r}
# character vectors containing logFC and p-values columns
dea.cols <- colnames(dat.dea$LMM)
logFC.cols <- dea.cols[stri_detect_fixed(dea.cols, 'logFC')]
significance.cols <- dea.cols[stri_detect_fixed(dea.cols, 'q.mod')]
n.contrasts <- length(logFC.cols)
```
## DEqMS
DEqMS, introduced by [Zhu et al.](https://doi.org/10.1074/mcp.TIR119.001646), is an extension of the popular moderated t-test from limma. DEqMS also shrinks protein sample variance towards a pooled estimate, but in a slightly different way, by relating the protein variance estimate to the number of PSMs or peptides used for protein quantification. The method exploits the information on PSM counts, but after that being absorbed, the rest of the modelling process takes place on the protein-level data. For this reason we applied the necessary two-step median aggregation, first from PSM to peptide, then from peptide to protein.
```{r, eval=!params$load_outputdata_p}
# first compute PSM count per protein within run & channel, then for each protein separately take the minimum of the PSM counts (as in DEqMS vignette)
PSMcounts.df <- dat.norm.l %>% ungroup %>% group_by(Run, Channel, Protein) %>% summarize(PSMcount=n()) %>% group_by(Protein) %>% summarize(PSMcount=min(PSMcount,na.rm=TRUE)) %>% data.frame
design.matrix <- get_design_matrix(referenceCondition, sample.info)
d <- column_to_rownames(as.data.frame(dat.norm.summ.w2), 'Protein')
dat.dea$DEqMS <- deqms_test(dat=d, design.matrix, scale='log', PSMcounts.df)
```
## One-way ANOVA (protein level data)
So far we considered two distinct approaches to differential protein testing, namely, a linear mixed-effects model that includes a random effect addressing potential correlation lurking in the PSM-level data of a protein, and DEqMS operating at the protein-level data and giving flexbility with respect to protein variance modelling. Hence, now it is time for a somewhat more rudimentary approach, for instance ANOVA applied to the protein-level data.
```{r, eval=!params$load_outputdata_p}
dat.dea$ANOVA_protein <- get_anova(d, design.matrix, scale='log')
```
## One-way ANOVA (PSM level data)
Since we still feel a little bit fixated with this unpopular idea of using the available data to the fullest extent, we add to our set of analyses an ANOVA conducted on the PSM level. This final piece of the puzzle will hopefully allow us to spot some interesting differences between the two ANOVAs, but also between the linear mixed-effect modelling and ANOVA on PSM-level data (if the LMM random effect impacts the results in a meaningful way).
```{r, eval=!params$load_outputdata_p}
dat.dea$ANOVA_PSM <- lm_dea(dat=dat.norm.l, mod.formula='response ~ Condition', referenceCondition, scale='log')
```
```{r, echo=FALSE, eval=params$save_outputdata_p}
# save output data
save(dat.nonnorm.summ.l
,dat.norm.summ.l
,dat.nonnorm.summ.w2
,dat.norm.summ.w2
,dat.norm.l
,dat.summ.l
,dat.dea, file=paste0('modelbased_DEA_outdata', params$suffix_p, '.rda'))
```
# Results comparison
Now, the most important part: let's find out how our component variants have affected the outcome of the DEA.
## Confusion matrix
A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value.
Turns out that all four testing approaches performed fairly similar.
Finally, the biological difference in the `0.667 vs 0.5` contrast, however, seems to be too small to be picked by the proposed modelling approach, regardless of the DEA methods.
```{r, results='asis'}
cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
```
## Scatter plot
To see whether the three Unit scales produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case).
Regarding q-values, we can merely notice the higher correlation between the variants working with the same type of data (PSM or protein-level data).
```{r}
scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
```
Log fold changes originating from all variants are well correlated. We can merely notice that using PSM-level data lead to a bit more scattered values around the center of distributions.
```{r}
scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
```
## Volcano plot
The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue; the magenta, dashed line indicates the theoretical fold change of the spike-ins.
Here too we cannot spot any qualitative difference in the selected four DEA variants.
```{r}
for (i in 1:n.contrasts){
volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)
}
```
## Violin plot
A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.
Clearly, the empirical results _tend towards_ the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here.
Lastly, the only thing that can be noticed in these visualisations is the similarity between methods exploiting the same data aggregation type (PSM versus protein).
```{r}
# plot theoretical value (horizontal lines) and violin per variant
if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea, function(x) x[,logFC.cols]), referenceCondition, show_truth = FALSE)
```
# Conclusions
All in all, for the given dataset all DEA methods did similarly good job at detecting significant changes in protein expression levels. It is also possible that the data at hand does not contain enough truly DE proteins (only 19) to expose potential discrepancies between the variants. The question whether it is worthwhile to work with PSM-level data requires further investigation.
# Session information
```{r}
sessionInfo()
```