-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adding markdown templates for top separators to the snakemake workflow
- Loading branch information
1 parent
8a5144b
commit 45450f4
Showing
16 changed files
with
6,896 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
141 changes: 141 additions & 0 deletions
141
experiments/assessing_cluster_clonality/data/markdowns/Br11_topSeparators.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
title: "Top separating mutations for CTC cluster cell pairs" | ||
author: "Katharina Jahn, Johannes Gawron" | ||
date: "March 2024" | ||
output: | ||
html_document: | ||
keep_md: yes | ||
--- | ||
|
||
## Index | ||
|
||
1. Data | ||
2. Method (explained based on LM2) | ||
3. Results for other cases | ||
|
||
|
||
## Data | ||
|
||
```{r initialization} | ||
source("../../workflow/resources/annotateVariants.R") | ||
sample_name <- "Br11" | ||
input_folder <- "/cluster/work/bewi/members/jgawron/projects/CTC/input_folder" | ||
``` | ||
|
||
#### Mutation distance matrix | ||
For each cluster (defined by color), we computed a pairwise distance for each | ||
mutation pair that indicates how often the two mutations occur in the same | ||
private branch of cells from the cluster: | ||
|
||
dist(M1, M2) = 0 (for M1 = M2) | ||
dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private | ||
branch of a cell from the cluster) (elsewise) | ||
|
||
A **private branch** is defined as the path from a leaf to the node just below | ||
the LCA of this leaf to another leaf from the same cluster. | ||
|
||
This is a generalization of the earlier method to find the top seperating | ||
mutations of pairs of leafs. The generalization was necessary to handle the | ||
larger clusters that were broken in more than 2 pieces. | ||
|
||
```{r} | ||
cluster_name <- "lightcoral" | ||
d <- read.table( | ||
file.path( | ||
input_folder, sample_name, | ||
paste0(sample_name, "_postSampling_", cluster_name, ".txt") | ||
), | ||
header = TRUE, sep = "\t", stringsAsFactors = FALSE, row.names = 1 | ||
) | ||
mat <- as.matrix(d) | ||
mat[1:4, 1:4] | ||
``` | ||
|
||
#### Position-wise coverage score | ||
For each position, we computed the percentage of samples that have a coverage of | ||
at least 3 at this position. This is meant as a simple score of the data quality | ||
of a position that can be used in addition to the separation score to pick | ||
mutations for the wet lab experiments. Furthermore, we added simple functional | ||
annotations to the variants. | ||
|
||
```{r message=FALSE} | ||
coverage <- read.table( | ||
file.path( | ||
input_folder, sample_name, | ||
paste(sample_name, "covScore.txt", sep = "_") | ||
), | ||
header = TRUE, | ||
sep = "\t", stringsAsFactors = FALSE, row.names = 1 | ||
) | ||
coverage$variantName <- rownames(coverage) | ||
head(coverage) | ||
annotations <- annotate_variants(sample_name, input_folder) | ||
coverage <- inner_join(coverage, annotations, by = "variantName") | ||
``` | ||
|
||
## Method | ||
#### Mutation clustering | ||
1. Overview: Raw plot of the distance matrix. | ||
2. Filter distant mutations: Remove all mutations that are not close to any | ||
other mutations (minDist>0.5) | ||
3. Dendrogram: Use the distance matrix to cluster the mutations using | ||
hierarchical clustering. | ||
4. Cluster remaining mutations: Re-do the hierarchical clustering witht the | ||
remaining mutations | ||
5. Define cut point to get about as many groups as there are cluster pieces | ||
6. Rank top separating mutations: Within each group, reduce distance matrix to | ||
mutations in the group, rank them by their average distance to other mutations | ||
in the group. | ||
|
||
|
||
###Overview | ||
To get an overview, we plot the full distance matrix: | ||
|
||
```{r raw plot} | ||
library(heatmaply) | ||
heatmaply(mat) | ||
``` | ||
|
||
|
||
### Filter out distant mutations | ||
|
||
```{r} | ||
mat2 <- mat | ||
diag(mat2) <- 1 | ||
min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations | ||
selected_muts <- which(min_dist < 0.9) # select those below 0.5 say | ||
mat2 <- mat[selected_muts, selected_muts] | ||
``` | ||
|
||
|
||
This is what the distance matrix looks like now: | ||
|
||
```{r} | ||
heatmaply(mat2) | ||
``` | ||
|
||
```{r} | ||
coverage %>% filter(variantName %in% colnames(mat2)) | ||
``` | ||
|
||
|
||
### Dendrogram of the remaining mutations | ||
|
||
To cluster mutations, we create a dendrogram based on the pairwise distances: | ||
```{r} | ||
d_mat <- as.dist(mat) | ||
hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on | ||
# distance matrix | ||
par(cex = 0.6) | ||
plot( | ||
hc, | ||
main = "Dendrogram based on average pairwise distance", sub = "", | ||
xlab = "Separating mutations" | ||
) | ||
``` | ||
|
||
|
||
No apparent clustering visible. |
Oops, something went wrong.