- General Description
- Dependencies
- Installation via Docker
- Installation from source
- Usage
- Input Files
- Setting parameters
- Output Files
- Contact
- Algorithmic Flow Chart
RNA editing is a source of transcriptomic diversity, mainly in non-coding regions, and is altered in cancer. Recent studies demonstrated that A-to-I RNA editing events are manifested at the proteomic level and contribute to protein heterogeneity in cancer. Given somatic RNA-editing mutation as input, PREP identify and evaluate the potential immunogenicity of RNA editing based peptides. Detailed information please refer to citation.
PREP currently tested on x86_64 on ubuntu 16.04.
- Python 2.7
- NetMHCpan 4.0
- Variant Effect Predictor (VEP)
- BWA
- STAR
- samtools
- Optitype
- GATK 3.8
- Picard tools
- Java 8
- kallisto
- trimmomatic
- vcftools
- blast
- tabix
- gawk
- yaml
- XGboost
- biopython
- scikit-learn==0.19.1
- pandas
- numpy
- Pyomo
- tables
- pysam
- future
- multiprocessing
- subprocess
- math
- matplotlib
Docker image of PREP is at https://hub.docker.com/r/bm2lab/prep/.
-
Install Docker on your computer and make sure it works.
-
Call docker
pull bm2lab/prep
which will download the Docker image. -
Run the image in interactive mode with your dataset:
docker run -it -v /your/path/to/dataset/:/home/bioworker/dataset bm2lab/prep /bin/bash
-
Change directory into /home/bioworker/project/PREP:
cd /home/bioworker/project/PREP
-
Download reference data:
bash data_download.sh
-
Edit
config.yaml
and fill the proper path of input files. -
Run the program with follow commands:
python PREP.py RE -i config.yaml
-
Install all software, python packages and R packages listed above, and make sure each software and package works in your system.
-
Install multiprocessing and other packages with the
pip
command:pip install -U multiprocessing ...
-
Download or clone the PREP repository to your local system:
git clone https://github.com/bm2-lab/PREP.git
-
Reference data includes genome fasta, peptide(GRCh38 build) could be downloaded and processed through:
bash data_download.sh
a few reference data would be in the fold
database
and processed by custom script in order to run the pipeline, including:[Fasta] This fold contains the reference fasta file, its bwa index and some other files result from `huamn.fasta`: Homo_sapiens.GRCh38.dna.primary_assembly.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa.amb Homo_sapiens.GRCh38.dna.primary_assembly.fa.ann etc... [Annotation file] This fold contains the vcf file used to run RNAEditor: 1000GenomeProject.vcf HAPMAP.vcf ESP.vcf dbSNP.vcf Mills_and_1000G_gold_standard.indels.hg38.vcf.gz [Protein] This fold contains the reference cDNA and protein sequence of human: Homo_sapiens.GRCh38.pep.all.fa
-
Among the required software listed above, GATK, kallisto, picard, samtools, trimmomatic-0.36 were prepared in software directory, other software should be installed by user own due to complexity, please refer to the software links above.
-
Fill in the
config.yaml
file with your local path, make sure you have installed all above software and have downloaded reference data.You should be aware that the version of VEP library you use should match the references used (peptide and cDNA). E.g. in the example above used version/release 89 of GRCh38.
You can use these two modes by:
python PREP.py RE -i config.yaml
PREP accepts pair-end or single-end RNA sequencing as input. It could be in .fastq.gz
or .fastq
format.
You should specify the right path to the sequencing file in config.yaml
like:
#your path to first RNA-seq fastq file
tumor_rna_fastq_1: ~/ncbi/dbGaP-14145/sra/SRR2673065_1.fastq.gz
#your path to second RNA-seq fastq file
tumor_rna_fastq_2: ~/ncbi/dbGaP-14145/sra/SRR2673065_2.fastq.gz
User should set all the parameters in the configuration file config.yaml
. The configuration file contains three parts of parameters:
- Input data parameters, including path of RNA sequencing data, output fold, run name.
- Software excutable path of opitype, vep, netMHCpan.
The output files are the following:
-
final_neoantigen.tsv
The file is a TSV file with the extracted mutated peptides derived from RNA editing with a quantitative score measures the immunity of neoepitopes.
The prediction output (final_neoantigen.tsv) for each peptide pair consists of the following columns:
Column Name | Description |
---|---|
Position | Mutation position in genome. |
HLA_type | HLA allele name. |
Gene | HUGO symbol name of mutatied gene. |
Transcript_name | Ensembl transcript ID |
Mutation | Necleotide change of mutated gene |
AA_change | Amino acid change annotated in VEP file. |
WT_pep | The extracted normal peptide. |
WT_Binding_EL | %Rank of prediction score for nomal peptides use NetMHCpan4.0 (defalut model). |
WT_Binding_Rank | %Rank of prediction score for nomal peptides use NetMHCpan4.0 (-ba model). |
MT_pep | The extracted mutant peptide. |
MT_Binding_EL | %Rank of prediction score for mutated peptides use NetMHCpan4.0(defalut model). |
MT_Binding_Rank | %Rank of prediction score for mutant peptides use NetMHCpan4.0 (-ba model). |
DriverGene_Lable | TRUE if the HUGO symbol is in the cosmic reference list, FALSE if it is not. |
MT_Binding_level_des | Binding level description of mutated peptide. |
WT_Binding_level_des | Binding level description of normal peptide. |
Editing_ratio | RNA ediitng level of the mutation. |
Neo_score | Immunogenicty score for RNA editing neoepitope. |
[email protected] or [email protected]
Biological and Medical Big data Mining Lab
Tongji University