adding markdown templates for top separators to the snakemake workflow

cbg-ethz · Jul 18, 2024 · 45450f4 · 45450f4
1 parent 8a5144b
commit 45450f4
Show file tree

Hide file tree

Showing 16 changed files with 6,896 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -12,7 +12,6 @@ experiments/.snakemake/*
 *.Rhistory
 experiments/assessing_cluster_clonality/data/markdowns/
 !experiments/assessing_cluster_clonality/data/markdowns/*topSeparators.Rmd
-experiments/assessing_cluster_clonality/data/htmls/*files
 experiments/assessing_cluster_clonality/logs/
 experiments/assessing_cluster_clonality/workflow/scripts/CTC_parseSampling/testData
 experiments/assessing_cluster_clonality/data/htmls/*

diff --git a/experiments/assessing_cluster_clonality/data/markdowns/Br11_topSeparators.Rmd b/experiments/assessing_cluster_clonality/data/markdowns/Br11_topSeparators.Rmd
@@ -0,0 +1,141 @@
+---
+title: "Top separating mutations for CTC cluster cell pairs"
+author: "Katharina Jahn, Johannes Gawron"
+date: "March 2024"
+output:
+  html_document:
+    keep_md: yes
+---
+
+## Index
+
+1. Data
+2. Method (explained based on LM2)
+3. Results for other cases
+
+
+## Data 
+
+```{r initialization}
+source("../../workflow/resources/annotateVariants.R")
+sample_name <- "Br11"
+input_folder <- "/cluster/work/bewi/members/jgawron/projects/CTC/input_folder"
+```
+
+#### Mutation distance matrix
+For each cluster (defined by color), we computed a pairwise distance for each
+mutation pair that indicates how often the two mutations occur in the same
+private branch of cells from the cluster:
+
+	dist(M1, M2) = 0 (for M1 = M2)
+	dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private
+	branch of a cell from the cluster) (elsewise)
+
+A **private branch** is defined as the path from a leaf to the node just below
+the LCA of this leaf to another leaf from the same cluster.
+
+This is a generalization of the earlier method to find the top seperating
+mutations of pairs of leafs. The generalization was necessary to handle the
+larger clusters that were broken in more than 2 pieces.
+
+```{r}
+cluster_name <- "lightcoral"
+
+d <- read.table(
+  file.path(
+    input_folder, sample_name,
+    paste0(sample_name, "_postSampling_", cluster_name, ".txt")
+  ),
+  header = TRUE, sep = "\t", stringsAsFactors = FALSE, row.names = 1
+)
+mat <- as.matrix(d)
+mat[1:4, 1:4]
+```
+
+#### Position-wise coverage score
+For each position, we computed the percentage of samples that have a coverage of
+at least 3 at this position. This is meant as a simple score of the data quality
+of a position that can be used in addition to the separation score to pick
+mutations for the wet lab experiments. Furthermore, we added simple functional
+annotations to the variants.
+
+```{r message=FALSE}
+coverage <- read.table(
+  file.path(
+    input_folder, sample_name,
+    paste(sample_name, "covScore.txt", sep = "_")
+  ),
+  header = TRUE,
+  sep = "\t", stringsAsFactors = FALSE, row.names = 1
+)
+coverage$variantName <- rownames(coverage)
+head(coverage)
+annotations <- annotate_variants(sample_name, input_folder)
+
+coverage <- inner_join(coverage, annotations, by = "variantName")
+```
+
+## Method
+#### Mutation clustering
+1. Overview: Raw plot of the distance matrix.
+2. Filter distant mutations: Remove all mutations that are not close to any
+other mutations (minDist>0.5)
+3. Dendrogram: Use the distance matrix to cluster the mutations using
+hierarchical clustering.
+4. Cluster remaining mutations: Re-do the hierarchical clustering witht the
+remaining mutations
+5. Define cut point to get about as many groups as there are cluster pieces
+6. Rank top separating mutations: Within each group, reduce distance matrix to
+mutations in the group, rank them by their average distance to other mutations
+in the group.
+
+
+###Overview
+To get an overview, we plot the full distance matrix:
+
+```{r raw plot}
+library(heatmaply)
+
+heatmaply(mat)
+```
+
+
+### Filter out distant mutations
+
+```{r}
+mat2 <- mat
+diag(mat2) <- 1
+min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations
+selected_muts <- which(min_dist < 0.9) # select those below 0.5 say
+mat2 <- mat[selected_muts, selected_muts]
+```
+
+
+This is what the distance matrix looks like now:
+
+```{r}
+heatmaply(mat2)
+```
+
+```{r}
+coverage %>% filter(variantName %in% colnames(mat2))
+```
+
+
+### Dendrogram of the remaining mutations
+
+To cluster mutations, we create a dendrogram based on the pairwise distances:
+```{r}
+d_mat <- as.dist(mat)
+hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on
+# distance matrix
+par(cex = 0.6)
+plot(
+  hc,
+  main = "Dendrogram based on average pairwise distance", sub = "",
+  xlab = "Separating mutations"
+)
+```
+
+
+No apparent clustering visible.