generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathairway.rmd
462 lines (331 loc) · 16.6 KB
/
airway.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
---
title: "Airway: DE analysis"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
html_document:
code_download: true
theme: cosmo
toc: true
toc_float: true
highlight: tango
number_sections: true
---
# Background
The data used in this workflow comes from an RNA-seq experiment where airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample.
For more description of the experiment see the article, PubMed entry 24926665, and for raw data see the GEO entry GSE52778.
Many parts of this tutorial are based on parts of a published RNA-seq workflow
available via Love et al. 2015 [F1000Research](http://f1000research.com/articles/4-1070) and as a [Bioconductor
package](https://www.bioconductor.org/packages/release/workflows/html/rnaseqGene.html) and on Charlotte Soneson's material from the [bss2019](https://uclouvain-cbio.github.io/BSS2019/rnaseq_gene_summerschool_belgium_2019.html) workshop.
# Data
FastQ files with a small subset of the reads can be found on https://github.com/statOmics/SGA2019/tree/data-rnaseq
Here, we will not use our count table because it is based on a small subset of the reads.
We will use the count table that was provided after mapping all the reads with the read mapper star.
```{r}
library(edgeR)
library(tidyverse)
library(GEOquery)
```
# MetaData
## GEO data
```{r}
gse<-getGEO("GSE52778")
pdata<-pData(gse[[1]])
pdata$SampleName<-rownames(pdata) %>% as.factor
head(pdata)[1:6,]
```
## SRA info
Download SRA info. To link sample info to info sequencing: Go to corresponding SRA page and save the information via the “Send to: File button” This file can also be used to make a script to download sequencing files from the web. Note that sra files can be converted to fastq files via the fastq-dump function of the sra-tools.
File is also available on course website.
```{r}
sraInfo<-read.csv("https://raw.githubusercontent.com/statOmics/SGA/airwaySeqData/SraRunInfo.csv")
sraInfo$SampleName <- as.factor(sraInfo$SampleName)
levels(pdata$SampleName)==levels(sraInfo$SampleName)
```
SampleNames are can be linked.
```{r}
pdata<-merge(pdata,sraInfo,by="SampleName")
pdata$Run
```
# Read featurecounts object
We import the featurecounts object that we have stored.
```{r}
fc <- readRDS(
url("https://github.com/statOmics/SGA2020/blob/data-rnaseq/airway/featureCounts/star_featurecounts.rds?raw=true")
)
colnames(fc$counts)
```
## Read Meta Data
```{r}
#pdata <- readRDS("airwayMetaData.rds")
```
# Data Analysis
## Setup count object edgeR
```{r}
dge <- DGEList(fc$counts)
colnames(dge)==pdata$Run
pdata$Run%in%colnames(dge)
```
```{r}
rownames(pdata) <- pdata$Run
pdata <- pdata[colnames(dge),]
pdata[,grep(":ch1",colnames(pdata))]
```
```{r}
pdata$cellLine <- as.factor(pdata[,grep("cell line:ch1",colnames(pdata))])
pdata$treatment <- as.factor(pdata[,grep("treatment:ch1",colnames(pdata))])
pdata$treatment <- relevel(pdata$treatment,"Untreated")
```
```{r}
colnames(dge) <- paste0(
substr(pdata$cellLine, 1, 3),
"_",
substr(pdata$treatment, 1, 3)
)
```
## Filtering and normalisation
```{r}
design <- model.matrix( ~ treatment + cellLine, data = pdata)
keep <- filterByExpr(dge, design)
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge<-calcNormFactors(dge)
```
## Data exploration
One way to reduce dimensionality is the use of multidimensional scaling (MDS). For MDS, we first have to calculate all pairwise distances between our objects (samples in this case), and then create a (typically) two-dimensional representation where these pre-calculated distances are represented as accurately as possible. This means that depending on how the pairwise sample distances are defined, the two-dimensional plot can be very different, and it is important to choose a distance that is suitable for the type of data at hand.
edgeR contains a function plotMDS, which operates on a DGEList object and generates a two-dimensional MDS representation of the samples. The default distance between two samples can be interpreted as the "typical" log fold change between the two samples, for the genes that are most different between them (by default, the top 500 genes, but this can be modified). We generate an MDS plot from the DGEList object dge, coloring by the treatment and using different plot symbols for different cell lines.
```{r}
plotMDS(dge, top = 500,col=as.double(pdata$cellLine))
```
## Differential analysis
### Model
We first estimate the overdispersion.
```{r}
dge <- estimateDisp(dge, design)
plotBCV(dge)
```
Finally, we fit the generalized linear model and perform the test. In the glmLRT function, we indicate which coefficient (which column in the design matrix) that we would like to test for. It is possible to test more general contrasts as well, and the user guide contains many examples on how to do this. The topTags function extracts the top-ranked genes. You can indicate the adjusted p-value cutoff, and/or the number of genes to keep.
```{r}
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, coef = "treatmentDexamethasone")
ttAll <- topTags(lrt, n = nrow(dge)) # all genes
hist(ttAll$table$PValue)
tt <- topTags(lrt, n = nrow(dge), p.value = 0.05) # genes with adj.p<0.05
nrow(tt)
```
There are `r nrow(tt)` genes statistiscally significant at the 5% FDR level.
### Plots
We first make a volcanoplot and an MA plot.
```{r}
library(ggplot2)
volcano <- ggplot(ttAll$table,aes(x=logFC,y=-log10(PValue),color=FDR<0.05)) + geom_point() + scale_color_manual(values=c("black","red"))
volcano
plotSmear(lrt, de.tags = row.names(tt$table))
```
Another way of representing the results of a differential expression analysis is to construct a heatmap of the top differentially expressed genes. Here, we would expect the contrasted sample groups to cluster separately. A heatmap is a "color coded expression matrix", where the rows and columns are clustered using hierarchical clustering. Typically, it should not be applied to counts, but works better with transformed values. Here we show how it can be applied to the variance-stabilized values generated above. We choose the top 30 differentially expressed genes. There are many functions in R that can generate heatmaps, here we show the one from the pheatmap package.
```{r}
heatmap(cpm(dge,log=TRUE)[rownames(tt$table)[1:30],])
```
# Issues
We find a huge number of DE genes. If this is the case it is always important to carefully check your results.
## p-values
```{r}
ttAll$table %>%
ggplot(aes(x=PValue)) +
geom_histogram()
```
We indeed see a large enrichment of small p-values.
We know that the inference is only asymptotically valid, i.e. for large experiments.
EdgeR is using a likelihood ratio test for statistical inference. The null distribution of this test statistic follows an asymptotic chi-square distribution.
If we test for a single contrast (e.g. one model parameter or one linear combination of model parameters) then the chi-square distribution has one degree of freedom and you will perform a two-sided test.
## Empirical null distribution
Efron 2008 relaxes the use of a theoretical null distribution for assessing statistical significance.
- He shows that we can exploit the massive parallel data structure of high throughput experiments to estimate the empirical null distribution.
- He assumes that the test statistics follow a normal distribution under the null, but proposes to estimate the location and the variance empirically.
Assuming that the test statistic has a standard normal null distribution is not restrictive. For example, suppose that $t$-tests have been applied and that the null distribution is $t_d$, with $d$ representing the degrees of freedom. Let $F_{td}$ denote the distribution function of $t_d$ and let $\Phi$ denote the distribution function of the standard normal distribution. If $T$ denotes the $t$-test statistic, then, under the null hypothesis,
\[
T \sim t_d
\]
and hence
\[
F_{td}(T) \sim U[0,1]
\]
and
\[
Z = \Phi^{-1}(F_{td}(T)) \sim N(0,1).
\]
If all $m$ null hypotheses are true, then each of the $Z_g$ is $N(0,1)$ and the set of $m$ calculated $z_g$ test statistics may be considered as a sample from $N(0,1)$. Hence, under these conditions we expect the histogram of the $m$ $z_g$'s to look like the density of the null distribution.
If a LRT test is conducted for a single contrast, we can obtain the corresponding $Z_g$ by converting the p-values into quantiles of the normal, i.e.
$z_g = \Phi^{-1} (1-p/2) \times \text{sign}(FC)$
The normal distribution is very useful for diagnostics because we are used to work with it.
So in general we
1. Start of the two-sided p-values.
2. We divide the p-values by two to have the probability in one of the tails
3. We calculate which quantile of the normal corresponds to it
4. We give the z-value the sign of our contrast.
```{r}
ttAll$table <- ttAll$table %>%
mutate(z = sign(logFC) * abs(qnorm(PValue/2)))
ttAll$table %>%
ggplot(aes(x=z)) +
geom_histogram(aes(y = ..density..), color = "black") +
stat_function(fun = dnorm,
args = list(
mean = 0,
sd=1)
)
```
The histogram clearly shows that the theoretical null distribution N(0,1) does not seen to match the null distribution in the airway dataset.
We can empirically estimate the null distribution using the locfdr package.
The locfdr package allows you to calculate local fdrs.
It uses the two groups model:
\[f(z) = \pi_0f_0(z) + (1-\pi_0) f_1(z)\]
The local fdr or posterior probability that the gene with a certain test statistic z is a false positive then becomes
\[lfdr = \frac{\pi_0 f_0(z)}{f(z)} \]
Efron suggests to use a cutoff for the lfdr of 20%. So we return all genes for which the probability that they are false positives is below 20%. He claims that this corresponds to a conventional FDR of about 5% in the set that you return.
It can be shown that the FDR:
\[
FDR = E[lfdr] \approx \frac{\sum\limits_{k \in DE-set} lfdr_k}{n_\text{DE}}
\]
We can also estimate the global FDR by simply calculating the average lfdr in the set that we return.
```{r}
library(locfdr)
ttAll$table <- ttAll$table %>%
mutate(lfdr = locfdr(z)$fdr)
```
In the plot we see the histogram of the z-values.
- The green line is the estimate of the marginal distribution $f(z)$. A non-parameteric density estimator is used for this purpose.
- The blue line is the estimate of the null distribution.
- We see that null distribution appears to be much broader than the standard normal and it shows a slight shift.
- We see that the marginal distribution (green distribution) is not estimated well.
- We also get a warning message that indicates that we can increase the degrees of freedom.
We re-estimate $f(z)$ by using more degrees of freedom, e.g. `df = 20`.
```{r}
ttAll$table <- ttAll$table %>%
mutate(lfdr = locfdr(z, df = 20)$fdr)
```
Estimated FDR corresponding to the set of genes with lfdr < 0.2 and number of significant genes with lfdr < 0.2?
```{r}
ttAll$table %>%
filter(lfdr < 0.2) %>%
summarize(
FDR = mean(lfdr, na.rm = TRUE),
nSig = lfdr %>%
is.na %>%
`!` %>%
sum)
```
Note, that
- the FDR of the set that is returned is indeed close to 0.5.
- the number of significant genes has been reduced considerably.
- It is very important to evaluate the fit of the null and marginal distribution in the plot.
If you would like to opt to use the FDR method of Benjamini and Hochberg, you also could convert the z-values back in p-values by using a normal $N(\hat \mu, \hat \sigma^2)$, where we use the estimated parameters for the null distribution by locfdr.
```{r}
h <- ttAll$table %>%
pull(z) %>%
locfdr(df = 20, plot=FALSE)
ttAll$table <- ttAll$table %>%
mutate(pEmp = pnorm(
abs(z),
mean = h$fp0["mlest","delta"],
sd = h$fp0["mlest","sigma"],
lower = FALSE) * 2
) %>%
mutate(pAdjEmp = p.adjust(
pEmp,
method="BH")
)
ttAll$table %>%
filter(pAdjEmp < 0.05) %>%
nrow
```
We observe that the number of significant genes reduces to about the same amount as with the lfdr method of Efron.
We, however, did not have to use a density estimator for the marginal cumulative distribution function of the test statistics. Instead, the BH method uses the empirical distribution of the statistics.
## Assess why deviations from the null occur?
### Failed mathematical assumptions?
If we permute the gene expression data of all observations differently for every gene, we simulate expressions under the null and breakup the correlation structure between the genes.
```{r}
set.seed(1004)
dgePermAll <- dge
dgePermAll$counts <- apply(dge$counts, 1, sample) %>% t
dgePermAll$samples$lib.size <- dgePermAll$counts %>% colSums
dgePermAll <- calcNormFactors(dgePermAll)
dgePermAll <- estimateDisp(dgePermAll, design)
plotBCV(dgePermAll)
fitPermAll <- glmFit(dgePermAll, design)
lrtPermAll <- glmLRT(fitPermAll, coef = "treatmentDexamethasone")
ttAllPermAll <- topTags(lrtPermAll, n = nrow(dgePermAll))
ttAllPermAll$table <- ttAllPermAll$table %>%
mutate(z = sign(logFC) * abs(qnorm(PValue/2)))
ttAllPermAll$table %>%
ggplot(aes(x=z)) +
geom_histogram(aes(y = ..density..), color = "black") +
stat_function(fun = dnorm,
args = list(
mean = 0,
sd=1)
)
```
We observe that the null distribution is approximately following a standard normal distribution.
There is no shift and the standard deviation in no longer larger than 1.
Hence, there do not seem to be issues related to failed mathematical assumptions.
### Correlation of the genes?
We will now only permute the samples across conditions, but, we keep all genes together, e.g. by permuting the design matrix.
This simulates data from the null and breaks the association between the predictors and potential confounders, however, correlation between the genes remains.
Here, we have a block design so we will permute within block.
We will perform a balanced permutation.
We will select two cellines at random and we will permute the labels of the treatment.
```{r}
designPerm <- design
cellLine <- sample(0:3,2)
cellLine
for (i in cellLine)
designPerm[i*2 + 1:2, 2] <- c(1,0)
designPerm
fitPermSamp <- glmFit(dge, designPerm)
lrtPermSamp <- glmLRT(fitPermSamp, coef = "treatmentDexamethasone")
ttAllPermSamp <- topTags(lrtPermSamp, n = nrow(dge))
ttAllPermSamp$table <- ttAllPermSamp$table %>%
mutate(z = sign(logFC) * abs(qnorm(PValue/2)))
ttAllPermSamp$table %>%
ggplot(aes(x=z)) +
geom_histogram(aes(y = ..density..), color = "black") +
stat_function(fun = dnorm,
args = list(
mean = 0,
sd=1)
)
```
We observe that the null distribution is approximately matching with a standard normal.
There is no shift and the standard deviation in no longer larger than 1.
Here, the distribution seems to be a bit more narrow than the the theoretical distribution.
#### Repeat for other permutation:
```{r}
designPerm <- design
cellLine <- c(0,2)
cellLine
for (i in cellLine)
designPerm[i*2 + 1:2, 2] <- c(1,0)
designPerm
fitPermSamp <- glmFit(dge, designPerm)
lrtPermSamp <- glmLRT(fitPermSamp, coef = "treatmentDexamethasone")
ttAllPermSamp <- topTags(lrtPermSamp, n = nrow(dge))
ttAllPermSamp$table <- ttAllPermSamp$table %>%
mutate(z = sign(logFC) * abs(qnorm(PValue/2)))
pPermSamp <- ttAllPermSamp$table %>%
ggplot(aes(x=z)) +
geom_histogram(aes(y = ..density..), color = "black") +
stat_function(fun = dnorm,
args = list(
mean = 0,
sd=1)
)
pPermSamp
pPermSamp + xlim(-4,4)
```
Both permutations show that there are no deviations from the theoretical null when we simulate under the null and break potential confounding but leave correlation between genes intact.
So there seems to be another reason for deviations of the null, than failed mathematical assumptions and correlation between genes!
Perhaps confounders that we do not account for?
## Concluding remark
This example also clearly shows that although permutation methods offer a simple and attractive way to deal with mathematically complicated hypothesis testing situations, they often cannot remedy failures of the theoretical null.
So the use of permutations to establish the null distribution should also be used with care in high throughput contexts. Indeed, depending on the permutation scheme that is used you might break correlations between genes, samples and/or potential confounding.
So even if you use a permutation approach in a high throughput context, e.g. a wilcoxon rank sum test or other permutation based tests, you still might need to check for deviations of the null!