Skip to content

Commit

Permalink
adding markdown templates for top separators to the snakemake workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
JohannesGawron committed Jul 18, 2024
1 parent 8a5144b commit 45450f4
Show file tree
Hide file tree
Showing 16 changed files with 6,896 additions and 1 deletion.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ experiments/.snakemake/*
*.Rhistory
experiments/assessing_cluster_clonality/data/markdowns/
!experiments/assessing_cluster_clonality/data/markdowns/*topSeparators.Rmd
experiments/assessing_cluster_clonality/data/htmls/*files
experiments/assessing_cluster_clonality/logs/
experiments/assessing_cluster_clonality/workflow/scripts/CTC_parseSampling/testData
experiments/assessing_cluster_clonality/data/htmls/*
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
title: "Top separating mutations for CTC cluster cell pairs"
author: "Katharina Jahn, Johannes Gawron"
date: "March 2024"
output:
html_document:
keep_md: yes
---

## Index

1. Data
2. Method (explained based on LM2)
3. Results for other cases


## Data

```{r initialization}
source("../../workflow/resources/annotateVariants.R")
sample_name <- "Br11"
input_folder <- "/cluster/work/bewi/members/jgawron/projects/CTC/input_folder"
```

#### Mutation distance matrix
For each cluster (defined by color), we computed a pairwise distance for each
mutation pair that indicates how often the two mutations occur in the same
private branch of cells from the cluster:

dist(M1, M2) = 0 (for M1 = M2)
dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private
branch of a cell from the cluster) (elsewise)

A **private branch** is defined as the path from a leaf to the node just below
the LCA of this leaf to another leaf from the same cluster.

This is a generalization of the earlier method to find the top seperating
mutations of pairs of leafs. The generalization was necessary to handle the
larger clusters that were broken in more than 2 pieces.

```{r}
cluster_name <- "lightcoral"
d <- read.table(
file.path(
input_folder, sample_name,
paste0(sample_name, "_postSampling_", cluster_name, ".txt")
),
header = TRUE, sep = "\t", stringsAsFactors = FALSE, row.names = 1
)
mat <- as.matrix(d)
mat[1:4, 1:4]
```

#### Position-wise coverage score
For each position, we computed the percentage of samples that have a coverage of
at least 3 at this position. This is meant as a simple score of the data quality
of a position that can be used in addition to the separation score to pick
mutations for the wet lab experiments. Furthermore, we added simple functional
annotations to the variants.

```{r message=FALSE}
coverage <- read.table(
file.path(
input_folder, sample_name,
paste(sample_name, "covScore.txt", sep = "_")
),
header = TRUE,
sep = "\t", stringsAsFactors = FALSE, row.names = 1
)
coverage$variantName <- rownames(coverage)
head(coverage)
annotations <- annotate_variants(sample_name, input_folder)
coverage <- inner_join(coverage, annotations, by = "variantName")
```

## Method
#### Mutation clustering
1. Overview: Raw plot of the distance matrix.
2. Filter distant mutations: Remove all mutations that are not close to any
other mutations (minDist>0.5)
3. Dendrogram: Use the distance matrix to cluster the mutations using
hierarchical clustering.
4. Cluster remaining mutations: Re-do the hierarchical clustering witht the
remaining mutations
5. Define cut point to get about as many groups as there are cluster pieces
6. Rank top separating mutations: Within each group, reduce distance matrix to
mutations in the group, rank them by their average distance to other mutations
in the group.


###Overview
To get an overview, we plot the full distance matrix:

```{r raw plot}
library(heatmaply)
heatmaply(mat)
```


### Filter out distant mutations

```{r}
mat2 <- mat
diag(mat2) <- 1
min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations
selected_muts <- which(min_dist < 0.9) # select those below 0.5 say
mat2 <- mat[selected_muts, selected_muts]
```


This is what the distance matrix looks like now:

```{r}
heatmaply(mat2)
```

```{r}
coverage %>% filter(variantName %in% colnames(mat2))
```


### Dendrogram of the remaining mutations

To cluster mutations, we create a dendrogram based on the pairwise distances:
```{r}
d_mat <- as.dist(mat)
hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on
# distance matrix
par(cex = 0.6)
plot(
hc,
main = "Dendrogram based on average pairwise distance", sub = "",
xlab = "Separating mutations"
)
```


No apparent clustering visible.
Loading

0 comments on commit 45450f4

Please sign in to comment.