generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathparathyroid.Rmd
292 lines (219 loc) · 9.67 KB
/
parathyroid.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Parathyroid: DE analysis"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
html_document:
code_download: true
theme: flatly
toc: true
toc_float: true
highlight: tango
number_sections: true
pdf_document:
toc: true
number_sections: true
linkcolor: blue
urlcolor: blue
citecolor: blue
---
```{r,echo=FALSE}
suppressPackageStartupMessages({
library(edgeR)
library(SummarizedExperiment)
library(tidyverse)
if(!"parathyroidSE" %in% installed.packages()[,1]) BiocManager::install("parathyroidSE")
})
```
# Introduction
Paired-end sequencing was performed on primary cultures from parathyroid tumors of 4 patients at 2 time points over 3 conditions (control, treatment with diarylpropionitrile (DPN) and treatment with 4-hydroxytamoxifen (OHT)). DPN is a selective estrogen receptor agonist and OHT is a selective estrogen receptor modulator. One sample (patient 4, 24 hours, control) was omitted by the paper authors due to low quality. Data, the count table and information on the experiment is available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211.
#Count data and meta data
```{r}
data("parathyroidGenesSE", package="parathyroidSE")
se1 <- parathyroidGenesSE
rm(parathyroidGenesSE)
dupExps <- colData(se1) %>%
as.data.frame() %>%
filter(duplicated(experiment)) %>%
pull(experiment)
counts <- assays(se1)$counts
newCounts <- counts
cd <- colData(se1)
for(ss in 1:length(dupExps)){
# check which samples are duplicates
relevantId <- which(colData(se1)$experiment == dupExps[ss])
# sum counts
newCounts[,relevantId[1]] <- rowSums(counts[,relevantId])
# keep which columns / rows to remove.
if(ss == 1){
toRemove <- relevantId[2]
} else {
toRemove <- c(toRemove, relevantId[2])
}
}
# remove after summing counts (otherwise IDs get mixed up)
newCounts <- newCounts[,-toRemove]
newCD <- cd[-toRemove,]
# Create new SummarizedExperiment
se <- SummarizedExperiment(assays = list("counts" = newCounts),
colData = newCD,
metadata = metadata(se1))
rm(se1)
```
# Data analysis
## Count object
```{r}
dge <- DGEList(counts=assay(se))
dge$sample
```
## Design
There can be an effect of agent, time interaction and agent x time interaction. We also expect blocking for patient. We can assess all effects of interest within patient.
```{r}
design <- model.matrix(~time*treatment+patient,colData(se))
rownames(design) = colnames(dge)
design
ExploreModelMatrix::VisualizeDesign(colData(se),~ time*treatment + patient)$plotlist
```
## Filtering
```{r}
keep <- filterByExpr(dge,design)
table(keep)
dge <- dge[keep, , keep.lib.sizes=FALSE]
```
## Normalisation
```{r}
dge <- calcNormFactors(dge)
dge$samples
```
## Data exploration
An MDS plot shows the leading fold changes (differential expression) between the 23 samples.
```{r}
plotMDS(dge,labels=paste(colData(se)$treatment,colData(se)$time,colData(se)$patient,sep="-"),col=as.double(colData(se)$treatment))
```
There is a very strong patient effect! To further assess the treatment effects we can make MDS plots per patient
```{r}
for (i in 1:4)
plotMDS(dge[,colData(se)$patient==i], col=as.double(colData(se)$treatment)[colData(se)$patient==i],
labels=paste(colData(se)$treatment[colData(se)$patient==i],
colData(se)$time[colData(se)$patient==i],
colData(se)$patient,sep="-")[colData(se)$patient==i])
```
## Parameter estimation
We will use the default Quasi likelihood approach of edgeR.
For quasi-likelihood we do not specify the full distribution, only the first two moments: the mean and the variance, which is sufficient to do inference on the mean.
$$
\left\{
\begin{array}{lcl}
E[y_{ig}\vert \mathbf{x}_{ig}]&=&\mu_{ig}\\
log(\mu_{ig})&=&\eta_{ig}\\
\eta_{ig}&=&\beta_0 + \beta_{t2} x_{t2,i} + \beta_{DPN} x_{DPN,i} + \beta_\text{DPN:t2} x_{DPN,i}x_{t2,i} \\
&& \quad + \beta_{OHT}x_{OHT,i} + \beta_\text{OHT:t2} x_{OHT,i}x_{t2,i} \\
&& \quad + \beta_{p2}x_{p2,i} + \beta_{p3}x_{p3,i} + \beta_{p4}x_{p4,i}\ + \log N_i\\
\text{Var}[y_{ig}\vert \mathbf{x}_{ig}]&=&\sigma^2_g\left(\mu_{ig}+\phi\mu_{ig}^2\right)
\end{array}\right.
$$
with $\sigma^2_g$ an additional dispersion parameter the scales the negative binomial variance function, $x_{DPN,i}$, $x_{DPN,i}$, $x_{t2,i}$, $x_{p.,i}$ dummy variables that is 1 if cell line was treated with DPN, OHT, incubated for 48 h, from patient $p.$, respectively and is 0 otherwise, and, $\log{N}_i$ a normalisation offset to correct for sequencing depth. Note, that $\beta_{DPN}$ is the main effect for the DPN treatment, and corresponds to the average log fold change between treated and control mice after 24h. The interaction $\beta_\text{DPN:t2}$ can be interpreted as the average change in log2 FC between DPN treated and control cell lines at the late and early timepoint. The researchers are also interested in a assessing third contrast: the effect of the DPN treatment at the late time point.
$$ \log_2\text{FC}^\text{48h}_\text{DPN - C}= \beta_{DPN}+\beta_{DPN,t2}$$
For the OHT treatment we will assess similar contrasts.
$$ \log_2\text{FC}^\text{24h}_\text{OHT - C}= \beta_{OHT}$$
$$ \log_2\text{FC}^\text{48h}_\text{OHT - C}= \beta_{OHT}+\beta_{OHT,t2}$$
$$ \log_2\text{FC}^\text{48h}_\text{OHT - C} -\log_2\text{FC}^\text{24h}_\text{OHT - C}= \beta_{OHT,t2}
$$
Finally, we also have to assess if there is a difference between DPN and OHT treatment
$$ \log_2\text{FC}^\text{24h}_\text{OHT - DPN}= \beta_{OHT} - \beta_{DPN}$$
$$ \log_2\text{FC}^\text{48h}_\text{OHT - C}= \beta_{OHT}+\beta_{OHT,t2} - \beta_{DPN} - \beta_{DPN,t2}
$$
$$ \log_2\text{FC}^\text{48h}_\text{OHT - DPN} -\log_2\text{FC}^\text{24h}_\text{OHT - DPN}= \beta_{OHT,t2} - \beta_{DPN,t2}
$$
```{r}
dge <- estimateDisp(dge, design)
plotBCV(dge)
```
The quasi-negative binomial model can be fitted using the function `glmQLFit`
```{r}
fit <- glmQLFit(dge,design)
```
## Contrasts
We now implement all 9 contrasts of interest
```{r}
L <- msqrob2::makeContrast(
c("treatmentDPN = 0",
"treatmentOHT = 0",
"treatmentOHT - treatmentDPN = 0",
"treatmentDPN + time48h:treatmentDPN = 0",
"treatmentOHT + time48h:treatmentOHT = 0",
"treatmentOHT + time48h:treatmentOHT - treatmentDPN - time48h:treatmentDPN = 0",
"time48h:treatmentDPN = 0",
"time48h:treatmentOHT = 0",
"time48h:treatmentOHT - time48h:treatmentDPN = 0"),
parameterNames = colnames(design))
L
```
## Tests
We have to perform a quasi- F-test for each contrast.
The quasi F-test involves fitting a different model for each contrast so that we can compare the full model with a reduced model that implies that one specific contrast is zero.
Because we estimated the additional dispersion parameter $\sigma^2_g$ using a sum of squared deviance residuals:
i.e.
$$e_{i,d} = 2 (l_i(y_i,y_i) - l_i(\mu_i,y_i)$$
and
$$
\hat \sigma_g^2 = \frac{\sum_{i=1}^n e_{i,d}^2}{n-p} = \frac{2\left[l(\mathbf{y},\mathbf{y}) - l(\boldsymbol{\mu},\mathbf{y})\right]}{n-p}
$$
With edgeR we will then further adopt empirical Bayes to borrow strength across genes to stabilise the parameter estimator, which will also increase the degrees if this gene wise dispersion parameter estimator which we refer to as $df_\text{res}^{EB}$.
We can use a quasi F -test that can also correct for the degrees of freedom that have been used to estimate mean model parameters and the residual degrees of freedom that were available for estimating the additional dispersion parameter. The quasi F test will thus perform better in a small sample setting. It is defined as:
$$
F = \frac{\frac{LRT_\text{g, full - reduced}}{df_{LRT}}}{\sigma^2_g}
$$
which follows an F - distribution with $df_{LRT}$ degrees of freedom in the nominator and $df_{res}^{EB}$ degrees of freedom in the denominator under the null hypothesis that the full and reduced model are equivalent and that the assessed contrasts are thus equal to zero. Indeed, the dispersion estimator in the denominator follows a scaled $\chi^2$ distribution with $df_{res}^{EB}$ degrees of freedom.
We perform all tests and loop over the columns of L for this purpose.
```{r}
testsF <- apply(L, 2, function(fit,contrast)
glmQLFTest(fit,contrast=contrast),
fit = fit)
topTablesF<- lapply(testsF, topTags, n=nrow(dge))
sapply(topTablesF, function(x) sum(x$table$FDR <0.05))
```
We only find significant fold changes for very few genes between OHT and the control treatment at the late time point.
We also did not find significant interactions.
# Plots
## Volcano plots
```{r}
for (i in 1:ncol(L))
{
volcano<- ggplot(topTablesF[[i]]$table,aes(x=logFC,y=-log10(PValue),color=FDR < 0.05)) + geom_point() + scale_color_manual(values=c("black","red")) + ggtitle(paste("contrast",names(topTablesF)[i]))
print(volcano)
}
```
## Histograms of p-values
```{r}
histsP <- lapply(topTablesF, function(x)
x$table %>%
ggplot(aes(x=PValue)) +
geom_histogram(breaks =seq(0,1,.1) ,col=1)
)
for (i in 1:ncol(L))
histsP[[i]] <- histsP[[i]] +
ggtitle(paste("contrast",names(topTablesF)[i]))
histsP
```
## heatmaps
```{r}
for (i in 1:ncol(L))
{
sigID <- topTablesF[[i]]$table %>%
filter(FDR<0.05) %>%
rownames
if (length(sigID)>0)
heatmap(dge$counts[sigID,], main = colnames(L)[i], cex.main=.2)
}
```
# EdgeR traditional
```{r}
fitGlm <- glmFit(dge,design)
testLRT2 <- apply(L, 2, function(fit,contrast)
glmLRT(fit,contrast=contrast),
fit = fitGlm)
topTablesLRT2 <- lapply(testLRT2, topTags, n=nrow(dge))
sapply(topTablesLRT2, function(x) sum(x$table$FDR <0.05))
```
We find more genes for the traditional edgeR workflow, however, it is known that this workflow is often too liberal.