proteogenomics_python

python scripts for proteogenomics analysis.

The whole workflow has been automated into one nextflow pipeline.

map_peptide2genome.py is a python script to map known peptides back to genome.You need three input files:
a gtf annotation file
a fasta file including protein sequences
a IDmap file which contains gene id, transcript id and protein id

IDmap file can be downloaded using Ensembl Biomart tool. See IDmap_file_example.txt.

python map_peptide2genome.py --input input_filename --gtf Homo_sapiens.GRCh37.ensembl87.gtf --fasta Homo_sapiens.GRCh37.ensembl87.pep.all.fa --IDmap Ensembl87_IDlist.txt --output output_filename

--input: peptide sequence in first column, protein accession in second column

3frame_translation.py is python script to do three frame translation.(default standard code) Example:

python 3frame_translation.py genome.fasta genome.3FT.fasta

sixframetranslation.py is python script to do six frame translation and full trypsin digestion at the same time. Example:

python sixframetranslation.py --input genome.fasta --output genome.6FT.txt --nuclear_trans_table 1 --mito_trans_table 2 --min_length 8 --max_length 30

--nuclear_trans_table is to specify translation table used for nuclear DNA --mito_trans_table is to specify translation table used for mitochondrial DNA

Curation of novel peptides from VarDB search

Map novel peptides back to genome

python map_novelpeptide2genome.py --input novpep.txt --gtf VarDB.gtf --fasta VarDB.fasta --gff_output example_novpeps.gff3 --tab_out example_novpep.hg19cor.txt

The input file novpep.txt must contain the two columns with the name: Peptide and Protein, which are required to map them to genome.

Make fasta file for novel peptides

python to_fasta.py example_novpeps.txt example_novpeps.fasta

BLASTP analysis

blastp -db UniProteome+Ensembl87+refseq+GENCODE24.proteins.fasta -query ../example_novpep.fasta -outfmt '6 qseqid sseqid pident qlen slen qstart qend sstart send mismatch positive gapopen gaps qseq sseq evalue bitscore' -num_threads 8 -max_target_seqs 1 -evalue 1000 -out example_novpeps.blastp.out.txt

Parse BLASTP output

python parse_blastp_out.py --input example_novpep.hg19cor.txt --blastp_result example_novpep.blastp.out.txt --fasta UniProteome+Ensembl87+refseq+GENCODE24.proteins.fasta --output example_novpeps.blastp.parsed.txt

Annotate loci - annovar

python prepare_annovar_input.py --input example_novpep.hg19cor.txt --output example_novpep_avinput.txt

you need to install annovar before you can run the next command. Annovar Download page

./annotate_variation.pl -out example_novpep -build hg19 example_novpep_avinput.txt humandb/

Parse annovar result

python parse_annovar_out.py --annovar_out example_novpep.variant_function --input example_novpeps.blastp.parsed.txt --output example_novpeps.blastp.annovar.txt

Extract PSMs of novel peptide with single substitution

The column name of peptide sequence should be "Peptide", otherwise use --peptide_column to specify a different name

python extract_1mismatch_novelpsm.py example_novpeps.blastp.annovar.txt example_novpeps.psms.txt example_novpep_1mismatch.psm.txt

Run SpectrumAI Download SpectrumAI here.
Parse SpectrumAI result

python parse_spectrumAI_out.py --spectrumAI_out specAI_file --input example_novpeps.blastp.annovar.txt --output output_filename

Validation in orthognal evidence

calculate conservation scores

python calculate_phastcons.py novel_peptides.gff3 hg19.100way.phastCons.bw novpeps.phastcons.scores.txt

predict phyloCSF coding potential see description here.

python calculate_phyloscf.py novel_peptides.gff3 file_path_to_bigwig novpeps.phyloCSF.scores.txt

count reads support for novel peptides in Bam files

python scam_bams.py --input_gff novel_peptides.gff3 --bam_files bam_files_list.txt --output novelpep_readcount.txt

build mutant protein and peptide sequences DB from COSMIC

sftp '[email protected]'@sftp-cancer.sanger.ac.uk

sftp> get cosmic/grch38/cosmic/v85/CosmicMutantExport.tsv.gz
sftp> get cosmic/grch38/cosmic/v85/All_COSMIC_Genes.fasta.gz
sftp> exit

python convertCOSMIC2_mutant_protein.py All_COSMIC_Genes.fasta CosmicMutantExport.tsv Cosmic_v85_mutant_protein.fasta
python digest_mutant_protein.py All_COSMIC_Genes.fasta Cosmic_v85_mutant_protein.fasta Cosmic_v85_mutant_peptides.fasta

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
3frame_translation.py		3frame_translation.py
BayesClassSpecificFDR.py		BayesClassSpecificFDR.py
IDmap_file_example.txt		IDmap_file_example.txt
README.md		README.md
calculate_phastcons.py		calculate_phastcons.py
calculate_phylocsf.py		calculate_phylocsf.py
convertCOSMIC2_mutant_protein.py		convertCOSMIC2_mutant_protein.py
digest_mutant_protein.py		digest_mutant_protein.py
extract_1mismatch_novpsm.py		extract_1mismatch_novpsm.py
group_novpepToLoci.py		group_novpepToLoci.py
label_nsSNP_pep.py		label_nsSNP_pep.py
label_sub_pos.py		label_sub_pos.py
manhatanplot.R		manhatanplot.R
map_cosmic_snp_tohg19.py		map_cosmic_snp_tohg19.py
map_novelpeptide2genome.py		map_novelpeptide2genome.py
map_peptide2genome.py		map_peptide2genome.py
parse_BLASTP_out.py		parse_BLASTP_out.py
parse_BLAT_out.py		parse_BLAT_out.py
parse_annovar_out.py		parse_annovar_out.py
parse_spectrumAI_out.py		parse_spectrumAI_out.py
scan_bams.py		scan_bams.py
sixframetranslation.py		sixframetranslation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proteogenomics_python

Curation of novel peptides from VarDB search

Validation in orthognal evidence

build mutant protein and peptide sequences DB from COSMIC

About

Releases

Packages

Contributors 4

Languages

yafeng/proteogenomics_python

Folders and files

Latest commit

History

Repository files navigation

proteogenomics_python

Curation of novel peptides from VarDB search

Validation in orthognal evidence

build mutant protein and peptide sequences DB from COSMIC

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages