forked from iaradsouza1/intro-single-cell
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_data_integration.qmd
272 lines (212 loc) · 11.4 KB
/
04_data_integration.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
title: "Data integration"
format: html
---
## Motivation
In single-cell RNA-seq analysis, the integration analysis refers to combine data from multiple samples or experiments to create an unified dataset. This integrated dataset can then be used for cell type annotation, where computational methods are applied to assign cell types to individual cells based on their gene expression patterns, just like we did before for a unique sample. By integrating data from multiple samples, we increase the diversity and the number of cells in out dataset, which can improve the accuracy and resolution of cell type annotation (and other analyses).
The integration process aims to remove unwanted differences between samples, so we can compare their cell types. However, common single-cell integration methods are not optimal, meaning that when they try to correct batch effects among the samples, they may also inadvertently remove biological variability (which is the kind of variability we actually care about). Another important limitation of integration methods is the assumption of similar cell type content across different samples. If you have samples with very different expected cell type content (e.g., a sample from liver and another one from brain cortex), there's no sense on integrating them.
Before integrating datasets, one needs to evaluate if there's actually a necessity to do that, since the integration process will impact the data. Also, there's no simple rule about how to integrate datasets, this decision will depend on the experimental design used and the biological questions associated with the analysis.
## Integrating data with Seurat
```{r}
library(Seurat)
library(tidyverse)
library(biomaRt)
library(patchwork)
library(SingleR)
library(SingleCellExperiment)
```
We're loading the whole dataset (four samples: two samples from healthy individuals and two from individuals who died from sepsis).
```{r}
load("results/sc_qc_split.rda")
```
Now let's do the standard analysis...
```{r}
sc_qc_split_unintegrated <- sc_qc_split %>%
NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA() %>%
FindNeighbors(reduction = "pca", dims = 1:20) %>%
FindClusters(resolution = 0.5) %>%
RunUMAP(dims = 1:20, n.components = 2, seed.use = 123) %>%
RunTSNE(dims = 1:20, n.components = 2, seed.use = 123)
```
For the Seurat object split into layers, the steps run previously (normalization, find variable features, scaling, etc.) are performed in each layer separatedly, but a consensus of HVG is automatically defined.
```{r}
DimPlot(sc_qc_split_unintegrated, reduction = "umap", group.by = c("group", "seurat_clusters", "orig.ident"))
```
Let's visualize the number of cells by clusters and groups:
```{r}
table(sc_qc_split_unintegrated$seurat_clusters, sc_qc_split_unintegrated$group)
```
Clearly, the number of cells in each cluster/group is unbalanced. This dataset needs to be integrated.
## Integrate using all methods made available by Seurat
The integration procedure returns a dimensional reduction that captures the shared sources of variance (it does not return the matrix of gene expression corrected by batch). This reduction is used to clustering analysis.
```{r}
# Use all possible integration methods made available by Seurat v5
options(future.globals.maxSize = 1000 * 1024^2)
# CCA (Seurat's default)
sc_qc_integrated_cca <- IntegrateLayers(
object = sc_qc_split_unintegrated, method = CCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.cca",
verbose = FALSE
) %>%
FindNeighbors(reduction = "integrated.cca", dims = 1:30, reduction.name = "umap.cca") %>%
FindClusters(resolution = 0.3) %>%
RunUMAP(reduction = "integrated.cca", dims = 1:30)
# Harmony
sc_qc_integrated_harmony <- IntegrateLayers(
object = sc_qc_split_unintegrated, method = HarmonyIntegration,
orig.reduction = "pca", new.reduction = "integrated.harmony",
verbose = FALSE
) %>%
FindNeighbors(reduction = "integrated.harmony", dims = 1:30, reduction.name = "umap.harmony") %>%
FindClusters(resolution = 0.3) %>%
RunUMAP(reduction = "integrated.harmony", dims = 1:30)
# RPCA
sc_qc_integrated_rpca <- IntegrateLayers(
object = sc_qc_split_unintegrated, method = RPCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.rpca",
verbose = FALSE
) %>%
FindNeighbors(reduction = "integrated.rpca", dims = 1:30, reduction.name = "umap.rpca") %>%
FindClusters(resolution = 0.3) %>%
RunUMAP(reduction = "integrated.rpca", dims = 1:30)
# Joint PCA
sc_qc_integrated_jointpca <- IntegrateLayers(
object = sc_qc_split_unintegrated, method = JointPCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.jointpca",
verbose = FALSE
) %>%
FindNeighbors(reduction = "integrated.jointpca", dims = 1:30, reduction.name = "umap.jointpca") %>%
FindClusters(resolution = 0.3) %>%
RunUMAP(reduction = "integrated.jointpca", dims = 1:30)
```
Let's check the results from CCA and Harmony methods:
```{r}
p1 <- DimPlot(sc_qc_integrated_cca, group.by = c("orig.ident", "group"), reduction = "integrated.cca") + ggtitle("CCA")
p2 <- DimPlot(sc_qc_integrated_harmony, group.by = c("orig.ident", "group"), reduction = "integrated.harmony") + ggtitle("Harmony")
p3 <- DimPlot(sc_qc_integrated_rpca, group.by = c("orig.ident", "group"), reduction = "integrated.rpca") + ggtitle("RPCA")
p4 <- DimPlot(sc_qc_integrated_jointpca, group.by = c("orig.ident", "group"), reduction = "integrated.jointpca") + ggtitle("Joint PCA")
```
```{r}
p1 / p2
```
```{r}
p3 / p4
```
Calculate mixing metrics (the lower the better):
```{r}
median(MixingMetric(sc_qc_integrated_cca, grouping.var = "orig.ident"))
median(MixingMetric(sc_qc_integrated_rpca, grouping.var = "orig.ident"))
median(MixingMetric(sc_qc_integrated_harmony, grouping.var = "orig.ident"))
median(MixingMetric(sc_qc_integrated_jointpca, grouping.var = "orig.ident"))
```
By Seurat's mixing metric, we can choose any of the results. For our next steps, we're choosing the CCA integration.
```{r}
sc_qc_integrated_cca[["RNA"]] <- JoinLayers(sc_qc_integrated_cca[["RNA"]])
DimPlot(sc_qc_integrated_cca, group.by = c("orig.ident", "group"), reduction = "integrated.cca")
DimPlot(sc_qc_integrated_cca, group.by = c("seurat_clusters", "group"), reduction = "umap")
```
Disclaimer: there's other ways of evaluating if our dataset was well integrated. As all steps in single-cell analysis, the integration analysis is iterative and can be redone if further steps indicate so. We can have better information if we annotate the integrated dataset and check if the cell labels inferred make sense. You can also try the approach detailed on [this](https://www.pnas.org/doi/10.1073/pnas.2313719121) paper, which implements a statistical test to evaluate the alignability of different datasets.
```{r}
save(sc_qc_integrated_cca, file = "results/integrated_cca.rda")
```
## Find conserved markers
Next, assuming the similar cell type content across samples, we can select the markers conserved across the conditions of interest. This means that genes are markers for that cell type irrespective of the cell condition or experimental group.
```{r}
# Find conserved markers for each cluster
map_dfr(unique([email protected]$RNA_snn_res.0.3), function(x) {
df <- FindConservedMarkers(sc_qc_integrated_cca, grouping.var = "group", ident.1 = x)
if(nrow(df) > 0) {
df$genes <- rownames(df)
df$cluster <- x
rownames(df) <- NULL
return(df)
}
}) -> conserved_markers
# Filter markers by proportion
conserved_markers %>%
mutate(diff_sepsis = non_survival_sepsis_pc.1 - non_survival_sepsis_pc.2,
diff_healthy = healthy_pct.1 - healthy_pct.2) %>%
filter(diff_sepsis > 0.5, diff_healthy > 0.5) -> conserved_markers_filtered
conserved_markers_filtered %>%
head()
```
## Annotate cells with `TransferData`
Different from the integration, with the transfer data approach, we do not modify the query data. The transfer data approach is similar to annotate cells with a reference-based approach. To do that, we need a reference single-cell dataset (you can download the reference atlas [here](https://datasets.cellxgene.cziscience.com/9074bc5a-a303-48ea-afb3-d429ac5b9dbe.rds)).
```{r}
ref <- readRDS("data/blood_tabula_sapiens/blood_atlas.rds")
DimPlot(ref, group.by = "cell_type", reduction = "umap")
```
The reference data genes are in ensembl gene id. Create a new Seurat object with HGNC symbol genes to match our query.
```{r}
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", ensembl)
ids <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "chromosome_name"),
filters = "ensembl_gene_id",
values = rownames(ref),
mart = ensembl)
genes_query <- rownames(sc_qc_integrated_cca)
ids <- ids %>%
filter(chromosome_name %in% c(1:22, "X", "Y"),
hgnc_symbol %in% genes_query)
duplicated_genes <- ids %>%
dplyr::count(hgnc_symbol) %>%
arrange(desc(n)) %>%
filter(n == 2) %>%
pull(hgnc_symbol)
# There are five genes with 2 ensembl gene ids for each symbol. For technical reasons we're excluding the duplicated genes. However, they need to be included on the count table. The best way is selecting the ensembl id of greatest variability.
duplicated_genes <- ids %>%
filter(hgnc_symbol %in% duplicated_genes) %>%
pull(ensembl_gene_id)
counts <- LayerData(ref, layer = "counts")
counts <- counts[-which(rownames(counts) %in% duplicated_genes),]
counts <- counts[which(rownames(counts) %in% ids$ensembl_gene_id),]
rownames(counts) <- ids$hgnc_symbol[match(rownames(counts), ids$ensembl_gene_id)]
blood_ref <- CreateSeuratObject(counts = counts, meta.data = [email protected])
rm(ref)
blood_ref <- blood_ref %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA %>%
RunUMAP(reduction = "pca", dims = 1:30)
DimPlot(blood_ref, group.by = c("assay", "donor_id"), reduction = "umap")
```
Select only the cells gathered from 10x 3'v3 assay:
```{r}
table([email protected]$assay)
blood_ref <- subset(blood_ref, assay == "10x 3' v3")
dim(blood_ref)
```
Integrate the reference considegring the donors as batches:
```{r}
blood_ref[["RNA"]] <- split(blood_ref[["RNA"]], f = blood_ref$donor_id)
blood_ref <- blood_ref %>%
NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA() %>%
FindNeighbors(reduction = "pca", dims = 1:30) %>%
FindClusters() %>%
RunUMAP(dims = 1:20, n.components = 2, seed.use = 123)
DimPlot(blood_ref, group.by = c("assay", "donor_id"), reduction = "pca")
```
```{r}
blood_ref <- IntegrateLayers(object = blood_ref, method = CCAIntegration, orig.reduction = "pca",
new.reduction = "integrated.cca", verbose = FALSE) %>%
FindNeighbors(reduction = "integrated.cca", dims = 1:30) %>%
FindClusters()
DimPlot(blood_ref, group.by = c("assay", "donor_id"), reduction = "integrated.cca")
```
Use TransferData()` to predict labels on our query dataset:
```{r}
# select two technologies for the query datasets
query <- JoinLayers(sc_qc_split_unintegrated)
anchors <- FindTransferAnchors(reference = blood_ref, query = query, dims = 1:30,
reference.reduction = "pca")
predictions <- TransferData(anchorset = anchors, refdata = blood_ref$cell_type, dims = 1:30)
query <- AddMetaData(query, metadata = predictions)
DimPlot(query, group.by = "predicted.id", label = T)
```