FredHutch · cansavvy · Nov 25, 2024 · Nov 26, 2024
diff --git a/.github/ISSUE_TEMPLATE b/.github/ISSUE_TEMPLATE
diff --git a/README.md b/README.md
@@ -1,86 +1,5 @@
 # GI_mapping
 
-This repository takes paired guide RNA counts data and calculates Genetic Interaction (GI) scores. It also performs some QC and filtering along the way.
-
-### How it can be run
-
-The entire analysis can be re-run on the Fred Hutch clusters if this repository is cloned there and ran from the Berger lab folders using the `run_pipeline.sh` script.
-
-The `run_pipeline.sh` script calls a Snakemake workflow (`workflows/Snakefile`). This is the core of the analysis and does the following:
-
-- It uses a conda environment so it shas the Python libraries and other software it needs to run. This information is from the `env/config.sample.yaml` file.
-- Next, it gets the count file names from a specific directory and stores them in a list, along with the cell line names extracted from the file names.
-- It defines a rule named `all` which specifies the final output files that the workflow should produce. The expand function is used to generate multiple file paths by substituting the `{cell_line}` wildcard with each cell line name in the `cell_line_list`. `{cell_line}` is a wildcard that gets replaced with actual cell line names when the workflow is run. This allows the same rule to be used for multiple cell lines.
-- Next there's a series of steps defined by each of these "rules". Each of these steps has their input, output, and separate conda environment, parameters, log file, and shell command to execute that need to be specified.
-
-Each step has these settings (I've described in plain speak what these are generally for).
-```
-input:
-    "This builds together the input file name using the wildcards specified at the start of the file"
-output:
-    "This specifies where the output results files should be stored"
-conda:
-    "This tells us where the conda environment file is for this step so we have the packages we need"
-params:
-    "This is defining the files and folders and other parameters"
-log:
-    "Where the log should be stored"
-shell:
-    "A shell command to be run that has the wildcards that are defined above -- this is what is doing the work"
-```
-### Core steps in the workflow:
-
-In the snakefile you can see where this is called, but if you want to see what is happening in the actual step you need to look at the corresponding Rmd file in the `scripts` folder.
-
-### `scripts/pgRNA_counts_QC.Rmd`
-
-This Rmd runs QC and applies a low count filter
-
-  - It makes a cummulative distribution function
-  - Prints out the Counts per million (CPM) per sample
-  - Does a sample to sample correlation
-  - Flags samples that maybe don't have enough counts for the plasmid
-    - This requries a cutoff to be set -- how low is too low?
-  - Then prints out what was removed.
-- `get_pgRNA_annotations` - This annotates the data
-  - It grabs Entrez Ids and Ensembl annotation
-  - Grabs Copy Number and Transcript per Millions data from a cancer dependency dataset using [depMap](https://bioconductor.org/packages/release/data/experiment/vignettes/depmap/inst/doc/depmap.html#1_introduction)
-  - It labels genes as `negative_control`, `positive_control`, `single_targeting` or `double_targeting`.
-
-### `scripts/calculate_LFC.Rmd`
-
-This Rmd calculates log fold change and makes heatmaps
-
-  - Does filtering based on annotations
-  - Uses custom plotting functions to make heatmaps and violin plots
-  - Calculates  SSMD?
-  - Investigates correlations across replicates
-  - How log fold changed is normalized:
-    - `Take LFC, then subtract median of negative controls. This will result in the median of the nontargeting being set to 0. Then, divide by the median of negative controls (double non-targeting) minus median of positive controls (targeting 1 essential gene). This will effectively set the median of the positive controls (essential genes) to -1.`
-    - `Since the pgPEN library uses non-targeting controls, we adjusted for the fact that single-targeting pgRNAs generate only two double-strand breaks (1 per allele), whereas the double-targeting pgRNAs generate four DSBs. To do this, we set the median (adjusted) LFC for unexpressed genes of each group to zero.`
-  - Does different handling for single level targeting versus double level targeting
-  - Calculates target level values
-
-### `scripts/calculate_GI_scores.Rmd`
-
-This Rmd calculates Genetic Interaction scores
-
-  - Calculates CRISPR mean score and handles single versus double targeted genes differently
-  - Creates a linear model
-  - Plots the GI scores
-  - Lastly a a Wilcoxon rank-sum test and t tests are performed using the calculated GI scores
-
-### Utils
-
-Additionally there is a script that has some util functionality that other steps borrow from:`scripts/shared_functions_and_variables.R`.
-
-
-### Original README is below:
-
-* include background on file name formatting
-* define the goals of the package, include a figure
-* write out all requirements for formatting files, etc. (or make a readthedocs?)
-
 ## To Do
 
 ### Phoebe