-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathASCN-RNA.Rmd
205 lines (162 loc) · 7.35 KB
/
ASCN-RNA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "Allele Specific Copy Number Inference in scRNAseq"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Allele Specific Copy Number Inference in scRNAseq}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning=FALSE, message=FALSE,
comment = "#>"
)
```
```{r setup}
library(signals)
library(tidyr)
library(dplyr)
library(cowplot)
library(ggplot2)
```
## Background
This vignette illustrates how to perform allele specific copy number inference in scRNAseq data. There are 2 methods to perform this inference, one which issues prior knowledge from allele specific scDNAseq inference and the second without prior knowledge. We'll demonstrate both approaches here.
## Data
The data needed to perform the allele specific copy number inference are SNP calls of heterozygous positions in individual single cells. Ideally haplotype block calling should also be performed, so that each allele in each cell can be assigned to a haplotype block. To get this data you will ideally need a whole genome sequenced bulk sample of normal tissue to identify heterozygous SNPs and then you can use a tool such as cellSNP or vartrix to get per cell counts. An example dataset from ~1000 cells is included here. For this dataset we also have paired scDNAseq, where we have called ASCN using `signals`, we'll use this as input for the method that utilzes prior knowledge.
```{r}
data(SA1049haps)
data(SA1049hscn)
```
## ASCN inference with prior knowledge
The first step is to format this dataframe. This will phase the haplotypes across all the cells and add additional columns such as the B-Allele Frequency. Our scDNAseq includes information on the phasing of haplotype blocks which we'll use as input.
```{r}
haplotypes <- format_haplotypes_rna(SA1049haps, phased_haplotypes = SA1049hscn$haplotype_phasing)
```
As a first look at the data we can calculate the BAF per arm in each cell and plot these distributions using `per_chr_baf`. Notice on this dataset that the BAFs in chromosomes `3p`, `9p`, `13q`, `17` and `22` in almost all cells are very left skewed towards 0 suggesting clonal LOH events in these chromosomes.
```{r, fig.width = 6, fig.height=6}
per_chr_baf(haplotypes, perarm = TRUE)
```
For comparison we can plot a similar distribution from our scDNAseq dataset. Here we notice similar trends but note that data from the scRNAseq is much more dispersed due to the lower number of counts per cell. scRNAseq will have 1-2 orders of magnitude lower reads so our ability to identify SNPs is reduced.
```{r, fig.width = 6, fig.height=6}
per_chr_baf(SA1049hscn$data, perarm = TRUE)
```
To illustrate these difference, the plot below shows the total number of SNP counts we see per cell in the 2 modalities.
```{r, fig.width = 6}
g1 <- SA1049hscn$data %>%
group_by(cell_id) %>%
summarise(totalcounts = sum(totalcounts) / 1e3) %>%
ggplot(aes(x = totalcounts)) +
geom_histogram(bins = 50) +
cowplot::theme_cowplot() +
xlab("1000's of counts") +
ggtitle("DNA")
g2 <- SA1049haps %>%
group_by(cell_id) %>%
summarise(totalcounts = sum(allele0 + allele1) / 1e3) %>%
ggplot(aes(x = totalcounts)) +
geom_histogram(bins = 50) +
cowplot::theme_cowplot() +
xlab("1000's of counts") +
ggtitle("RNA")
cowplot::plot_grid(g2, g1)
```
We can also plot heatmaps of the 2 modalities as shown below.
```{r, fig.width=7}
plotHeatmapBAF(haplotypes)
```
```{r, fig.width=7}
plotHeatmapBAF(SA1049hscn$data)
```
Now we'll move onto inferreing allele specific copy number in our dataset. As illustrated above, the scRNAseq is typically very sparse so we only attempt to identify arm level events. As we have prior knowledge of the states from scDNAseq here we use the `assign_states` function. This method identifies arm level states in the scDNAseq data that are represented in more than `minf` proportion of cells, and then uses maximum likelihood to assign each arm in each cell to the most probably state based on the observed counts. You can also apply an empricial bayes based shrinkage, which we will do here.
```{r}
hscn_rna_arm <- assign_states(haplotypes, SA1049hscn$data, minf = 0.05, shrinkage = TRUE)
```
In order to compare we'll also pull out arm level CN alterations in the scDNAseq.
```{r}
hscn_dna_arm <- per_chrarm_cn(SA1049hscn$data, arms = unique(hscn_rna_arm$chrarm))
```
### QC
With our output data we can then generate some plots to QC the output. First of all we can plot the distribution of BAF values per state.
```{r, fig.width=8, fig.height=12}
library(ggplot2)
cowplot::plot_grid(plotBAFperstate(hscn_rna_arm) + ggtitle("RNA"),
plotBAFperstate(hscn_dna_arm) + ggtitle("DNA"), ncol = 1)
```
As in this example the cells have come from the same aliquot we'd expect the proportion of cells with different alterations to be similar. We can vizualize this with the following plot.
```{r, fig.width=8.5}
plot_proportions(hscn_dna_arm, hscn_rna_arm)
```
### Heatmaps
Finally we can plot heatmaps of the resulting outputs.
Firstly using the scRNA.
```{r, fig.width=8.5}
cl <- umap_clustering(hscn_rna_arm, field = "BAF")
plotHeatmap(hscn_rna_arm,
clusters = cl$clustering,
reorderclusters = TRUE,
plotfrequency = TRUE,
spacer_cols = 15,
plotcol = "state_BAF",
plottree = FALSE,
widenarm = TRUE,
show_legend = FALSE)
```
And here for scDNA.
```{r, fig.width=8.5}
plotHeatmap(hscn_dna_arm,
plotfrequency = TRUE,
spacer_cols = 15,
plotcol = "state_BAF",
plottree = FALSE,
widenarm = TRUE,
show_legend = FALSE)
```
## ASCN inference without prior knowledge
In this section, we'll perform the inference without prior knowledge. In this case we don't attempt to identify accurate integer level allele specific copy number but rather identify chromosome arms that are either heterezygous, have imbalance or are balanced.
```{r}
haplotypes_noprior <- format_haplotypes_rna(SA1049haps)
```
We'll plot the distribution as before.
```{r, fig.width = 6, fig.height=6}
per_chr_baf(haplotypes_noprior, perarm = TRUE)
```
```{r}
hscn_rna_arm_noprior <- assign_states_noprior(haplotypes_noprior, shrinkage = TRUE)
```
We can also plot the raw BAF heatmap as before.
```{r, fig.width=7}
plotHeatmapBAF(haplotypes_noprior)
```
As well as the inferred state level heatmaps, using the same cluster ordering as before.
```{r, fig.width=8.5}
plotHeatmap(hscn_rna_arm_noprior,
clusters = cl$clustering,
reorderclusters = TRUE,
plotfrequency = TRUE,
spacer_cols = 15,
plotcol = "state_BAF",
plottree = FALSE,
widenarm = TRUE,
show_legend = FALSE)
```
## Dirichlet Process clustering
```{r, fig.width=8.5}
res <- assign_states_dp(hscn_dna_arm, samples = 5)
plotHeatmap(res$ascn,
clusters = res$clusters,
reorderclusters = TRUE,
plotcol = "state_phase",
widenarm = TRUE,
plottree = FALSE)
```
```{r, fig.width=8.5}
x <- per_arm_baf_mat(haplotypes)
res <- assign_states_dp(x$bafperchr, samples = 5, overwrite_chr = c("1q"))
plotHeatmap(res$ascn,
clusters = res$clusters,
reorderclusters = TRUE,
plotcol = "state_phase",
widenarm = TRUE,
plottree = FALSE)
```