Skip to content

Latest commit

 

History

History
146 lines (102 loc) · 9.18 KB

README.md

File metadata and controls

146 lines (102 loc) · 9.18 KB

gimap - Genetic Interaction MAPping for dual target CRISPR screens

Background on paired guide CRISPR

Some genes have "backup copies" - these are called paralog genes. Think of it like having a spare tire in your car - if one fails, the other can often do a similar job. This redundancy makes it tricky to study what these genes do, because if you knock out just one gene, its partner might pick up the slack.

CRISPR allows genes to be knocked out so we can see what their function is. But because of paralog redundancy it can be hard to parse out what genes are actually involved. So instead of just targeting one gene at a time, paired guide CRISPR screening allows us to knock out two genes simultaneously. It's particularly useful for understanding:

  • What happens when you disable both backup copies of a gene
  • How different genes might work together
  • Which gene combinations are essential for cell survival

What is gimap?

gimap - is a software tool that helps make sense of paired CRISPR screening data. Here's what it does:

  1. Takes data from paired CRISPR screens that has been pre-processed by the pgmap software, or any counts table of paired gRNA reads
  2. The input data will have cell counts for how well cells grow (or don't grow) when different genes or pairs of genes are disabled
  3. gimap can take this data and helps identify interesting patterns, like:
    • When disabling two genes together is more devastating than you'd expect from disabling them individually (called synthetic lethality)
    • When genes work together cooperatively
    • Which gene combinations might be important in diseases like cancer

gimap can help find meaningful patterns in complex genetic experiments. It's particularly focused on analyzing data generated by screening cells with the Berger Lab paired gRNA CRISPR screening library, called pgPEN (paired guide RNAs for genetic interaction mapping).

The gimap package is based off of the code and research from the Berger Lab

pgPEN library design

The gimap package is based off of an original paired CRISPR knockout library design from the Berger lab.

There are four target types included in the custom Berger lab pgPEN library:

  • double_targeting - these cells have two different genes that have been knocked out with pgRNAs. This is sometimes noted as "gene_gene"
  • single_targeting - these cells have one gene that have been knocked out with pgRNA and another that is designed to not target any gene. This is includes "gene_ctrl" and "ctrl_gene".
  • positive_control - these cells have one essential control gene that has been knocked out with pgRNA and another that is designed to NOT target any genes. These are also noted as This is includes "gene_ctrl" and "ctrl_gene" but for when the gene is an essential gene, e.g. required in most cell lines for survival.
  • negative_control - these cells have two pgRNAs designed to NOT target any genes. This is noted as "ctrl_ctrl".

In the instance of a single gene pair: e.g. geneA_geneB, there are 32 different constructs related to it. 32 = 16 single targeting and 16 double targeting

  • There are 16 double targeting: geneA_geneBpg1, geneA_geneBpg2, ... geneA_geneBpg16.
  • 4 unique targeting sequences for gene A * 4 unique targeting sequences for gene B = 16 unique combos of double targeting constructs.
  • There are 16 single targeting in relation to this geneA_geneB construct. There are: geneA_ctrl and ctrl_geneB sequences.
  • (4 unique targeting sequences ("geneA") * 2 non targeting sequences ("ctrl") = 8 constructs)
  • (2 non targeting sequences ("ctrl") * 4 unique targeting sequences ("geneB") = 8 constructs)

About Genetic Interaction Scores

The output of the gimap package is genetic interaction scores which are the distance between the observed CRISPR score and the expected CRISPR score. The expected CRISPR scores are what we expect for the CRISPR values to be for two unrelated genes. The further away an observed CRISPR score is from its expected score the more we suspect genetic interaction.

This can be true in a positive way (a CRISPR knockout pair caused more cell proliferation than expected -- called cooperativity) or in a negative way (a CRISPR knockout pair caused more cell lethality than expected -- called synthetic lethality).

The genetic interaction scores are based on a linear model calculated for each sample where observed_crispr_single is the outcome variable and expected_crispr_single is the predictor variable.

For each sample:

lm(observed_crispr_single ~ expected_crispr_single)

Using y = mx+b, we can fill in the following values:

  • y = observed CRISPR score
  • x = expected CRISPR score
  • m = slope from linear model for this sample
  • b = intercept from linear model for this sample

The intercept and slope from this linear model are used to adjust the CRISPR scores for each sample:

single_target_gi_score = observed single crispr - (intercept + slope * expected single crispr)
double_target_gi_score = double crispr score - (intercept + slope * expected double crispr)

These single and double target genetic interaction scores are calculated at the construct level and are then summarized using a t-test to see if the the distribution of the set of double targeting constructs is significantly different than the overall distribution of single targeting constructs. After multiple testing correction, FDR values are reported. Low FDR value for a double construct means high likelihood of paralog redundancy, also known as synthetic lethality.

Expected CRISPR scores

Expected CRISPR scores are calculated based on the pgRNAs included.

For a double_targeting construct we would expect the CRISPR score to be the sum of the single targeting CRISPR scores which use the same gRNA sequences present in the pgRNA.

expected crispr double = single target crispr 1 + single target crispr 2

So in the instance of the double construct GSN_SCIN_pg1 we would expect its CRISPR score to be the same as GSN_ctrl + ctrl_SCIN where the pgRNA used to target GSN and SCIN respectively are the same.

For single_targeting constructs we would expect the CRISPR score to be the sum of a single target plus the mean of the control pgRNAs from the negative control constructs (or double non-targeting CRISPR constructs).

expected crispr single = single target crispr 1 + mean negative control for the pgRNA

So for TRPC5_nt1 single target we would take its CRISPR score + the mean of the nt1 sequence across the constructs where it is included in a double non targeting construct.

Normalization

gimap takes in a counts matrix that represents the number of cells that have each type of pgRNA. This data needs some normalization before CRISPR scores and Genetic Interaction scores can be calculated.

There are four steps of normalization.

  1. Calculate log2CPM - First we account for different read depths across samples and transform data to log2 counts per million reads.
log2((counts / total counts for sample)) * 1 million) + 1)
  1. Calculate log2 fold change - This is done by subtracting the log2CPM for the pre-treatment from each sample. control is what is highlighted. The pretreatment is the day 0 of CRISPR treatment, before CRISPR pgRNAs have taken effect.
log2FC = log2CPM for each sample - pretreament log2CPM
  1. Normalize by negative and positive controls - Calculate a negative control median for each sample and a positive control median for each sample and divide each log2FC by this value. In this version then we are normalizing by the median difference of the negative and positive controls.
# FOR EACH SAMPLE:
log2FC adjusted =
(log2FC - log2FC median negative control) /
(log2FC median negative control - median log2FC positive control)

CRISPR scores

Since the pgPEN library uses non-targeting controls, we adjust for the fact that single-targeting pgRNAs generate only two double-strand breaks (1 per allele), whereas the double-targeting pgRNAs generate four DSBs. To do this, we set the median LFC of each group to zero.

Calculate medians of based on single and double targeting and subtract these medians from log2FC adjusted

crispr score = log2FC adjusted - median for each target type

Prerequisites

In order to run this pipeline you will need R and to install the gimap package and its dependencies. In R you can run this to install the package:

install.packages("remotes")
remotes::install_github("FredHutch/gimap")

Getting Started Tutorial

Now you can go to our quick start tutorial to get started!

We also have tutorial examples that show how to run timepoint or treatment experimental set ups with gimap:

Follow the steps there that will walk you through the example data. Then you can tailor that tutorial to use your own data.

Citations:

See metrics about this repository here