From 720657b890f5db9bae5cbc87667e51ab0a8d0550 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Mon, 25 Nov 2024 14:09:06 -0500 Subject: [PATCH 1/2] Testing readability thing --- .github/ISSUE_TEMPLATE | 16 ---------------- 1 file changed, 16 deletions(-) delete mode 100644 .github/ISSUE_TEMPLATE diff --git a/.github/ISSUE_TEMPLATE b/.github/ISSUE_TEMPLATE deleted file mode 100644 index 17a4e7a..0000000 --- a/.github/ISSUE_TEMPLATE +++ /dev/null @@ -1,16 +0,0 @@ - - -### Issue Description -Please provide a brief description of the issue here. - -### Inputs -Please describe your experimental/sequencing strategy (e.g. pgPEN library and screening data, sequenced with an Illumina MiSeq, etc.) - -### Your environment -Describe how you ran the pipeline (e.g. high-performance computing [HPC] cluster, linux/shell interface (Ubuntu, MacOS terminal, Windows powershell, etc.) - -### Steps to reproduce -Please include relevant commands and steps taken. If your counts table was generated outside of the Berger Lab's pgPEN analysis workflow, please also describe the steps taken to pre-process your data. - -### Pipeline behavior and error messages -Please include relevant screenshots or codeblocks with any relevant error messages. If your code ran up to a certain point before encountering an error, please include this information as well. From c30a9bb6cf16609fc2ffb9bf3fa02b93996bad4d Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 26 Nov 2024 11:03:27 -0500 Subject: [PATCH 2/2] Update README.md --- README.md | 81 ------------------------------------------------------- 1 file changed, 81 deletions(-) diff --git a/README.md b/README.md index 3c8c8a2..7f4f311 100644 --- a/README.md +++ b/README.md @@ -1,86 +1,5 @@ # GI_mapping -This repository takes paired guide RNA counts data and calculates Genetic Interaction (GI) scores. It also performs some QC and filtering along the way. - -### How it can be run - -The entire analysis can be re-run on the Fred Hutch clusters if this repository is cloned there and ran from the Berger lab folders using the `run_pipeline.sh` script. - -The `run_pipeline.sh` script calls a Snakemake workflow (`workflows/Snakefile`). This is the core of the analysis and does the following: - -- It uses a conda environment so it shas the Python libraries and other software it needs to run. This information is from the `env/config.sample.yaml` file. -- Next, it gets the count file names from a specific directory and stores them in a list, along with the cell line names extracted from the file names. -- It defines a rule named `all` which specifies the final output files that the workflow should produce. The expand function is used to generate multiple file paths by substituting the `{cell_line}` wildcard with each cell line name in the `cell_line_list`. `{cell_line}` is a wildcard that gets replaced with actual cell line names when the workflow is run. This allows the same rule to be used for multiple cell lines. -- Next there's a series of steps defined by each of these "rules". Each of these steps has their input, output, and separate conda environment, parameters, log file, and shell command to execute that need to be specified. - -Each step has these settings (I've described in plain speak what these are generally for). -``` -input: - "This builds together the input file name using the wildcards specified at the start of the file" -output: - "This specifies where the output results files should be stored" -conda: - "This tells us where the conda environment file is for this step so we have the packages we need" -params: - "This is defining the files and folders and other parameters" -log: - "Where the log should be stored" -shell: - "A shell command to be run that has the wildcards that are defined above -- this is what is doing the work" -``` -### Core steps in the workflow: - -In the snakefile you can see where this is called, but if you want to see what is happening in the actual step you need to look at the corresponding Rmd file in the `scripts` folder. - -### `scripts/pgRNA_counts_QC.Rmd` - -This Rmd runs QC and applies a low count filter - - - It makes a cummulative distribution function - - Prints out the Counts per million (CPM) per sample - - Does a sample to sample correlation - - Flags samples that maybe don't have enough counts for the plasmid - - This requries a cutoff to be set -- how low is too low? - - Then prints out what was removed. -- `get_pgRNA_annotations` - This annotates the data - - It grabs Entrez Ids and Ensembl annotation - - Grabs Copy Number and Transcript per Millions data from a cancer dependency dataset using [depMap](https://bioconductor.org/packages/release/data/experiment/vignettes/depmap/inst/doc/depmap.html#1_introduction) - - It labels genes as `negative_control`, `positive_control`, `single_targeting` or `double_targeting`. - -### `scripts/calculate_LFC.Rmd` - -This Rmd calculates log fold change and makes heatmaps - - - Does filtering based on annotations - - Uses custom plotting functions to make heatmaps and violin plots - - Calculates SSMD? - - Investigates correlations across replicates - - How log fold changed is normalized: - - `Take LFC, then subtract median of negative controls. This will result in the median of the nontargeting being set to 0. Then, divide by the median of negative controls (double non-targeting) minus median of positive controls (targeting 1 essential gene). This will effectively set the median of the positive controls (essential genes) to -1.` - - `Since the pgPEN library uses non-targeting controls, we adjusted for the fact that single-targeting pgRNAs generate only two double-strand breaks (1 per allele), whereas the double-targeting pgRNAs generate four DSBs. To do this, we set the median (adjusted) LFC for unexpressed genes of each group to zero.` - - Does different handling for single level targeting versus double level targeting - - Calculates target level values - -### `scripts/calculate_GI_scores.Rmd` - -This Rmd calculates Genetic Interaction scores - - - Calculates CRISPR mean score and handles single versus double targeted genes differently - - Creates a linear model - - Plots the GI scores - - Lastly a a Wilcoxon rank-sum test and t tests are performed using the calculated GI scores - -### Utils - -Additionally there is a script that has some util functionality that other steps borrow from:`scripts/shared_functions_and_variables.R`. - - -### Original README is below: - -* include background on file name formatting -* define the goals of the package, include a figure -* write out all requirements for formatting files, etc. (or make a readthedocs?) - ## To Do ### Phoebe