forked from ElkonLab/scAPA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscAPA_vignette.rmd
221 lines (172 loc) · 7.51 KB
/
scAPA_vignette.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "scAPA package vignette"
output: rmarkdown::github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Install and load scAPA package.
```{r load packeges}
if(!require(devtools)) install.packages("devtools")
require(devtools)
if(!require(scAPA)) devtools::install_github("ElkonLab/scAPA/scAPA")
require(scAPA)
```
The file Peaks.RDS used for this example is the result of running scAPA.shell.script.R on data from [Lukassen et al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6132189/).
This **is not** a downsampled version, but an analysis of the full bams.
Download and load the scAPAList object Peaks.RDS.
```{r Download, echo=TRUE, eval=FALSE}
peaks.url <- "https://github.com/ElkonLab/scAPA/blob/master/Rfiles/Peaks.RDS"
download.file(url = peaks.url, destfile = "Peaks.RDS")
```
```{r load, echo=TRUE}
a <- readRDS("Peaks.RDS")
a
```
The object contains the following:
* The peak count matrix:
```{r show count matrix, echo=T}
head([email protected][,1:3])
```
Rows are peaks, and columns are cells. The cells name is composed of its sample and barcodes (the sample is added to cope with cells barcodes that are duplicated between samples).
* Cell-cluster annotions:
```{r show cluster anotations}
head([email protected])
```
The first column is the cell barcodes as it appears in the peak count matrix. The second is their assigned cell clusters. Other columns, such as the tsne coordinates, are optional.
* A data.frame with the sequence downstream the 3' edge of each peak:
```{r show down seq}
head([email protected],2)
```
This is required for filtering peaks that may result from internal priming.
* Information regarding the peaks genomic location (optional slot):
```{r show row.Data}
head([email protected])
```
First, calculate the sum of the counts from each peak in each cell cluster.
```{r calc_clusters_counts}
a <- calc_clusters_counts(a)
a
```
This step is required if the option sc was specified in scAPA.shell.scipt.R If sc = false, clusters' counts have already been calculated.
##Peak filtering
First, we calculate the peak counts-per-million (CPM) peak cluster matrix.
```{r calc_cpm}
a <- calc_cpm(a)
head(a@norm$CPM)
```
Then, only peaks whose total sum over all cell clusters is above a certion CPMs threshold are considered. Here we choose 10 as the threshold:
```{r filter law CPMs}
keep.cpm <- rowSums(a@norm$CPM) > 10
a <- a[keep.cpm,]
a
```
To exclude internal priming suspected peaks, peaks having a stretch of at least 8 consecutive As in the region between 10 nt to 140 nt. to the peak's end are filtered:
```{r filter_IP}
a <- filter_IP(x = a, int.priming.seq = "AAAAAAAA",
left = 10,
right = 140)
a
```
The sequence and the regions can be varied using the arguments "int.priming.seq", "left", "right". See the function manual page for more details by typing ?filter_IP in R studio.
## Statistical analysis – detection of dynamic APA events between cell clusters
* Set an scAPAresults object:
```{r}
results <- set_scAPAresults(a)
results
```
* The slot clus.counts is a list such that each 3’UTR with more than one peak is represented by a table where rows are its associated peaks and columns are cell clusters.
```{r show results clus.counts}
[email protected][1:2]
```
An analogous list is produced for the 'cells.counts' slot:
```{r show results cells.counts}
[email protected][[1]][,1:3]
```
Here each column is a cell.
* For each such table, perform a Chi-squared test, p-values are corrected for multiple testing using BH FDR.
```{r test_APA}
results <- test_APA(results, clus = "all")
head(results@pvalues$all)
```
The clus argument specifies the clusters to be tested. Default is "all" for all cluster. specify, for example, clus = c("ES", "RS") to test ES vs RS.
* For 3' UTR with more than two peaks that show significant usage change, for each peak, chi-square test for goodness of fit is performed. The threshold for significant change is set by the argument sig.level (FDR value)
```{r test_peaks}
results <- test_peaks(results, clus = "all", sig.level = 0.05)
```
We get a slot 'peak.pvaues' with such peaks (p-values, q-values, and peak PUI for each cluster).
```{r}
head([email protected]$Peaks_all)
```
## Inferring global trends of APA modulation
* The proximal pA site usage index (proximal PUI) is used to quantify the relative usage of the most proximal pA site within a 3' UTR (with two or more peaks). See [pipeline.description.pdf](pipeline.description.pdf) for the definition of this index.
```{r calc_p_pui_mat}
results <- calc_p_pui_mat(results)
head([email protected])
```
This function also calculates the mean PUI index for each cell and assin it to the slot 'ppui.cells'.
```{r ppui.cells}
head([email protected])
```
Typing the name of the scAPAresult object in R now will show a summary of the results:
```{r summary}
results
```
* We creat a scAPAresults object containing only 3'UTR with significant APA events.
```{r significants}
sig.utrs <- as.vector(results@pvalues$all[,2] < 0.05)
sig <- results[which(sig.utrs),]
```
* Plot the cumulative distribution function of the proximal PUI of the significant 3'UTRs:
```{r plot ppui, eval=F}
require(ggplot2)
tidy.ppui <- as.data.frame([email protected])
colnames(tidy.ppui) <- gsub(x = colnames(tidy.ppui),
pattern = "_proximal_PUI", replacement = "")
tidy.ppui <- tidyr::gather(data = tidy.ppui)
colnames(tidy.ppui) <- c("Cluster", "value")
p <- ggplot(data = tidy.ppui, aes(x = value, color = Cluster)) +
stat_ecdf(size = 1) + theme_bw() + xlab("Proximal peak usage index") +
ylab("Cumulative fraction") + ggplot2::coord_cartesian(xlim = c(-3, 4.1))
print(p)
```
* Consider 3' UTRs with exactly two peaks to identify events of 3'UTR shortening
(or lengthening).
```{r}
two.peaks <- sapply([email protected], function(x){nrow(x) == 2})
sig.two <- sig[which(two.peaks),]
sig.two
```
For such 3' UTRs, an increase in the value of the proximal PUI in cell type 1 relative to cell type 2 indicates 3' UTR shortening in cell type 1. For example, comparing RS to SC:
```{r RS vs SC}
RS.SC.shortening = sum([email protected][,2] > [email protected][,3],
na.rm = T)
RS.SC.lengthening = sum([email protected][,2] < [email protected][,3],
na.rm = T)
RS.SC.shortening
RS.SC.lengthening
```
* We can plot a tSNE coloring cells according to their mean proximal PUI, as follows:
```{r tsne, eval=F}
require(ggplot2)
mean_pui <- as.data.frame([email protected])
mean_pui$Cell_BC <- gsub(pattern = "_proximal_PUI", replacement = "",
x = rownames(mean_pui))
colnames(mean_pui)[1] <- "Mean Proximal PUI"
tsne <- merge([email protected], mean_pui, by.x = "Cell_BC")
require(gridExtra)
colnames(tsne)[2:4] <- c("Cell Cluster", "tsne1", "tsne2")
p <-ggplot(tsne, ggplot2::aes(x = tsne1 ,
y = tsne2,
color = `Mean Proximal PUI`)) +
geom_point(size=0.8) + xlab("tSNE1") + ylab("tSNE2") +
theme_bw() +scale_color_gradient(low = "blue", high = "red") +
ggtitle("tSNE mean proximal PUI") +
theme(legend.position="top")
g <- ggplot(tsne, ggplot2::aes(x = tsne1, y = tsne2,
color = `Cell Cluster`)) +
ggplot2::geom_point(size=0.8) + xlab("tSNE1") +
ylab("tSNE2") + theme_bw() + ggtitle("tSNE cell clusters") +
theme(legend.position="top")
gridExtra::grid.arrange(g, p, nrow = 1)
```