Skip to content

Latest commit

 

History

History
86 lines (61 loc) · 5.98 KB

README.md

File metadata and controls

86 lines (61 loc) · 5.98 KB

Welcome to the EPInformer framework repository! EPInformer is a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with epigenomic signals. EPInformer is designed for three key applications: 1) predict gene expression levels using promoter-enhancer sequences, epigenomic signals, and chromatin contacts; 2) identify cell-type-specific enhancer-gene interactions and conduct in-silico perturbation; 3) predict enhancer activity and recapitulate transcription factor binding motifs from sequences. The framework is described in the following bioRxiv preprint:

https://www.biorxiv.org/content/10.1101/2024.08.01.606099v1.

This repository can be used to run the EPInformer model to predit gene expression (e.g., CAGE-seq and RNA-seq) and prioritize enhancer-gene interactions for input DNA sequences and epigenomic signals (e.g., DNase, H3K27ac and Hi-C).

We also provide information and instructions for how to train different versions of EPInformer given diffenet inputs including DNA sequence, epigemoic signals and chromatin contacts.

Requirements

EPInformer requires Python 3.6+ and Python packages PyTorch (>=2.1). You can follow PyTorch installation steps here.

Setup

EPInformer requires ABC enhancer-gene data for training and predicting gene expression. You can obtain the ABC data from ENCODE or by running the ABC pipeline available on their GitHub acquire cell-type-specific gene-enhancer links. For K562 and GM12878 cell lines, you can download the training resource of EPInformer from Zenodo by running the command:

sh ./download_data.sh

To experiment three applications below with EPInformer, please first run the folloing command to setup the environment:

# Clone this repository
git clone https://github.com/JasonLinjc/EPInformer.git
cd EPInformer

# create 'EPInformer_env' conda environment by running the following:
conda create --name EPInformer_env python=3.8 pandas scipy scikit-learn jupyter seaborn
source activate EPInformer_env

# GPU version pytorch
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
# CPU version pytorch
conda install pytorch cpuonly -c pytorch

# Other pacakges
pip install pyranges pyfaidx kipoiseq openpyxl tangermeme

1. Gene expression prediction

An end-to-end example to predict gene expression from promoter-enhancer sequences, epigenomic signals and chromatin contacts is in 1_predict_gene_expression.ipynb. You can run this notebook yourself to experiment with different EPInformers.

2. Enhancer-gene links prediction

To prioritize the enhancer-gene links tested by CRISPRi-FlowFISH in K562, we obtain the original data from their supplementary table. We provide a jupyter notebook (2_prioritize_enhancer_gene_links.ipynb) for pre-processing CRISPRi-FlowFISH data and scoring enhancer-gene links using EPInformer-derived attention scores and the Attention-ABC score. Additionally, this notebook provides a end-to-end example of in-silico perturbations on candidate elements within 100kb of KLF1 and predicting their effects, with KLF1 excluded from the training data to prevent overfitting.

3. Enhancer activity prediction and TF motif discovery

To predict cell-type-specific enhancer activity, we provide sequence-based predictors trained on H3K27ac and DNase signals in K562 and GM12878 cell lines separately. Enhancer activity was calculated using the ABC score. Additionally, Tangermeme was used to perform in-silico saturation mutagenesis (ISM) on the enhancer sequence to identify key motifs contributing to the predicted activity. The notebook (3_predict_enhancer_activity.ipynb) is available for experimenting with enhancer activity prediction and transcription factor motif discovery.

Training

You can re-train EPInformer models on K562 and GM12878 data using the command lines:

# Download K562 and GM12878 data
sh ./download_data.sh

# Train EPInformer-PE on K562 to predict CAGE-seq expression
python train_EPInformer.py --cell K562  --model_type EPInformer-PE --expr_assay CAGE --use_pretrained_encoder --batch_size 16 --cuda

# Train EPInformer-PE-Activity on GM12878 to predict RNA-seq expression
python train_EPInformer.py --cell GM12878 --model_type EPInformer-PE-Activity --expr_assay RNA --use_pretrained_encoder --batch_size 16 --cuda

# Train EPInformer-PE-Activity-HiC on K562 to predict RNA-seq expression
python train_EPInformer.py --cell K562 --model_type EPInformer-PE-Activity-HiC --expr_assay RNA --use_pretrained_encoder --batch_size 16 --cuda

Help

Please post in the GitHub issues or e-mail Jiecong Lin (jieconglin(at)@outlook.com) with any question about the repository, requests for more data, etc.