Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing readability thing #5

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 0 additions & 16 deletions .github/ISSUE_TEMPLATE

This file was deleted.

81 changes: 0 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,5 @@
# GI_mapping

This repository takes paired guide RNA counts data and calculates Genetic Interaction (GI) scores. It also performs some QC and filtering along the way.

### How it can be run

The entire analysis can be re-run on the Fred Hutch clusters if this repository is cloned there and ran from the Berger lab folders using the `run_pipeline.sh` script.

The `run_pipeline.sh` script calls a Snakemake workflow (`workflows/Snakefile`). This is the core of the analysis and does the following:

- It uses a conda environment so it shas the Python libraries and other software it needs to run. This information is from the `env/config.sample.yaml` file.
- Next, it gets the count file names from a specific directory and stores them in a list, along with the cell line names extracted from the file names.
- It defines a rule named `all` which specifies the final output files that the workflow should produce. The expand function is used to generate multiple file paths by substituting the `{cell_line}` wildcard with each cell line name in the `cell_line_list`. `{cell_line}` is a wildcard that gets replaced with actual cell line names when the workflow is run. This allows the same rule to be used for multiple cell lines.
- Next there's a series of steps defined by each of these "rules". Each of these steps has their input, output, and separate conda environment, parameters, log file, and shell command to execute that need to be specified.

Each step has these settings (I've described in plain speak what these are generally for).
```
input:
"This builds together the input file name using the wildcards specified at the start of the file"
output:
"This specifies where the output results files should be stored"
conda:
"This tells us where the conda environment file is for this step so we have the packages we need"
params:
"This is defining the files and folders and other parameters"
log:
"Where the log should be stored"
shell:
"A shell command to be run that has the wildcards that are defined above -- this is what is doing the work"
```
### Core steps in the workflow:

In the snakefile you can see where this is called, but if you want to see what is happening in the actual step you need to look at the corresponding Rmd file in the `scripts` folder.

### `scripts/pgRNA_counts_QC.Rmd`

This Rmd runs QC and applies a low count filter

- It makes a cummulative distribution function
- Prints out the Counts per million (CPM) per sample
- Does a sample to sample correlation
- Flags samples that maybe don't have enough counts for the plasmid
- This requries a cutoff to be set -- how low is too low?
- Then prints out what was removed.
- `get_pgRNA_annotations` - This annotates the data
- It grabs Entrez Ids and Ensembl annotation
- Grabs Copy Number and Transcript per Millions data from a cancer dependency dataset using [depMap](https://bioconductor.org/packages/release/data/experiment/vignettes/depmap/inst/doc/depmap.html#1_introduction)
- It labels genes as `negative_control`, `positive_control`, `single_targeting` or `double_targeting`.

### `scripts/calculate_LFC.Rmd`

This Rmd calculates log fold change and makes heatmaps

- Does filtering based on annotations
- Uses custom plotting functions to make heatmaps and violin plots
- Calculates SSMD?
- Investigates correlations across replicates
- How log fold changed is normalized:
- `Take LFC, then subtract median of negative controls. This will result in the median of the nontargeting being set to 0. Then, divide by the median of negative controls (double non-targeting) minus median of positive controls (targeting 1 essential gene). This will effectively set the median of the positive controls (essential genes) to -1.`
- `Since the pgPEN library uses non-targeting controls, we adjusted for the fact that single-targeting pgRNAs generate only two double-strand breaks (1 per allele), whereas the double-targeting pgRNAs generate four DSBs. To do this, we set the median (adjusted) LFC for unexpressed genes of each group to zero.`
- Does different handling for single level targeting versus double level targeting
- Calculates target level values

### `scripts/calculate_GI_scores.Rmd`

This Rmd calculates Genetic Interaction scores

- Calculates CRISPR mean score and handles single versus double targeted genes differently
- Creates a linear model
- Plots the GI scores
- Lastly a a Wilcoxon rank-sum test and t tests are performed using the calculated GI scores

### Utils

Additionally there is a script that has some util functionality that other steps borrow from:`scripts/shared_functions_and_variables.R`.


### Original README is below:

* include background on file name formatting
* define the goals of the package, include a figure
* write out all requirements for formatting files, etc. (or make a readthedocs?)

## To Do

### Phoebe
Expand Down