Skip to content

Latest commit

 

History

History

decode_analysis

Analysis of Amino Acid Substitution Site Context

Bash and R scripts to analyze the protein and genomic context of sites of amino acid substitution, as provided by Dr. Shiri Tsour.

All figures and data tables for raim's part of the analysis can be generated by running the blast-based mapping pipeline in run_blast.sh and the analysis pipeline in run_analysis.sh.

Figures only: If you are only interested in the downstream analysis and generation of published figures, you just need to download the additionalData.zip (step 3 below), and run the R scripts called from run_analysis.sh (step 5) in your favorite R environment with these five required packages. Simply adjust the proj.path variable in raas_init.R, the directory where you unpacked additionalData.zip.

To run the full mapping and analysis pipeline you need to follow these steps:

  1. Make sure all software dependencies are met,

Bash scripts should run on a standard Linux terminal (cat, grep, cut, awk, gunzip, rsync), and additionally require NCBI's blast (we used version 2.15.0+) and a recent R version (>3.6.3) to be installed. For R, see install_packages.R.

On Debian/Ubuntu this may work, but likely comes with a different version of blast:

sudo apt-get update
sudo apt install ncbi-blast+
sudo apt install r-base
## install five R packages
R --vanilla < install_packages.R
  1. Choose a location where you want to store input and output data, and tell the bash terminal about it:
export DECODE=/home/raim/data/decode
mkdir $DECODE

where you replace the path by your favored location.

  1. Download the additionalData.zip (described below) and unpack it into this new directory $DECODE:
unzip <DOWNLOADPATH>/additionalData.zip -d $DECODE
  1. Run the bash script that blasts all BP/SAAP data, and collects various structural information for the best-matching protein sites, and its transcript and genome coordinates.
./run_blast.sh

and follow instructions, if errors occur. Note, that log files are generated in $DECODE/log. Check these, if unsure what the errors are. At this point, some QC figures were generated in $DECODE/figures. All generated data is in $DECODE/processedData.

NOTE, that the last step in this script additionally requires a large external data base to generate saap_mapped.tsv. This external data is generated by the script setup.sh in the project folder data/mammary/ of the https://gitlab.com/raim/genomeBrowser/ project (release RAAS_preprint). Since this involves many steps and high performance computing, we provide saap_mapped.tsv with the additionalData.zip.

If you do want to re-run the full pipeline, you need to generate this input data or modify the R script map_peptides.R, and the downstream analysis scripts to not use all data. Many analyses do not require all data. To use a freshly generated saap_mapped.tsv in downstream analyses (next step), you need to copy it from the processedData to the additionalData folder, overwriting the downloaded original, or adjust the path for this file in raas_init.R.

  1. Run the bash script that analyses data and generates results plots
./run_analysis.sh

or alternatively open the R scripts called from this bash script in Rstudio, make sure data is available as above, and modify the main input/output path in the file raas_init.R.

Additional Data

The archive additionalData.zip contains the following files, all required at different stages of the blast pipeline or the analysis.

  • Tab-delimited and unfiltered versions of three official supplemental data tables; these include BP/SAAP from immunoglobulin genes and other artefacts:
    • All_SAAP_TMTlevel_quant_df.txt.gz (Supplemental_Data_3.SAAP_precursor_quant.xlsx),
    • All_SAAP_patient_level_quant_df.txt.gz (Supplemental_Data_4.SAAP_reporter_quant.xlsx),
    • All_SAAP_protein_filter_df.txt.gz (Supplemental_Data_2.SAAP_proteins.xlsx),
  • saap_mapped.tsv.gz: The main blast-based mapping file that holds various collected information for each unique BP/SAAP pair. The columns of this file are detailed below.
  • Three Supplemental Data tables from other publications:
    • Mathieson et al. 2018, Supplementary Data 2: protein half-lives, 41467_2018_3106_MOESM5_ESM.xlsx,
    • Wu et al. 2019, Figure 1 source data 2: codon stability coefficients, elife-45396-fig1-data2-v2.csv.gz,
    • McCormick et al. 2024, Supplementary Table 1: psi sites in six cell lines, as provided by the authors, six_cell_lines_minimal.xlsx.
  • Human genome data downloaded and generated by scripts in https://gitlab.com/raim/genomeBrowser, project path data/mammary, and release RAAS_preprint; see script data/mammary/setup.sh, which calls all download and processing steps:
    • Various Ensembl <-> refseq/uniprot/synonym mappings, uniprot_ensembl.dat.gz, uniprot_name.dat.gz, gene_synonyms.tsv.gz, ensembl_refseq_20240528.tsv.gz, protein_transcript_map.tsv.gz,
    • A GOslim term definition file, goslim.tsv.gz,
    • A fasta of all coding regions of all Ensembl transcripts, coding.fa.gz,
    • Codon counts for all Ensembl transcripts, coding_codons.tsv.gz,
    • A chromosome length/name index file, sequenceIndex.csv.gz,
    • A human genome feature table, features_GRCh38.110.tsv.gz,
    • A human genome tRNA table, codons_GRCh38.tsv.gz.

Columns of saap_mapped.tsv

The file saap_mapped.tsv provides the blast-based re-mapping of base peptides (BP) to the Ensembl human genome version GCRh38, release 110, and including proteins with patient-specific single amino acid substitutions detected by one of two variant calling pipelines.

  • BP: "base peptide" as encoded by the genome or by patient-specific transcripts, these were blasted against the Ensembl proteome (GRCh38, release 110) fasta, supplemented with mutated proteins detected by one of two variant calling pipelines.
  • SAAP: "substituted amino acid peptide", same as BP with one AA difference.
  • protein: Ensembl protein ID of the main selected blast hit, which includes proteins with patient-specific substitutions, indicated by suffixes (_Q23G); Note, that these names were shortened to 47 chars but are unique, full original names (from the variant calling pipeline) are available if required.
  • identity, ..., bitscore: blast output.
  • ensembl: the Ensembl protein ID without suffixes.
  • gene: the Ensembl ID of the gene that encodes for the transcript and protein.
  • transcript: the Ensembl ID for the transcript encoding for the protein in column ensembl.
  • MANE: the Ensembl transcript ID for the MANE subset of transcripts associated with the same gene, mostly but not always identical with the transcript column. If it differs the blast hit was NOT for the MANE isoform of the same gene.
  • MANE.protein: the Ensembl ID for the protein encoded by the MANE transcript.
  • numGO: number of GO terms associated with the gene.
  • match: classification of the blast hit quality, good where full-length matches with 100% identity and 0 mismatches, only those should be used for analysis of protein, transcript or genome coordinates. Note, that we were not able to account for mutations detected by the second variant calling pipeline. Doing this should remove the bad (<100 % identity, >0 mismatches, or not a full length hit) and wrong (>2 mismatches) blast hits.
  • nogene: no Ensembl gene could be mapped.
  • IG: the blast hit is tagged as an immunoglobulin (pattern biotype=IG_.* in the Ensembl gff3 file); since these are expected to be mutated in individual immune cells but not detected by patient-specific genome or transcript sequencing these ARE EXCLUDED from all analyses.
  • albumin: contains "albumin" in the gff3 file's description column.
  • globin: contains "globin" in the gff3 file's description column.
  • extracellular: the gene is is annotated with the GO term GO:0005576 for "extracellular region".
  • exclude: outdated tag to exclude the BP/SAAP from analyses, if either IG, albumin or globin is TRUE.
  • name: human-readable gene name (Name= tag in the gff3 description column).
  • site: position of the substituted amino acid in the BP and the SAAP,
  • pos: position of the substituted amino acid in the protein indicated in column protein (blast hit).
  • len: length of the protein.
  • rpos: pos/len, relative position of the substituted amino acid in the protein.
  • from: the encoded amino acid (in the BP).
  • to: the substituted amino acid (in the SAAP).
  • codon: the codon coding for the encoded amino acid (from) in the matching Ensembl transcript.
  • tpos: the position of the codon (2nd position) in the Ensembl transcript sequence as provided in the fasta file: Homo_sapiens.GRCh38.cdna.all.fa.gz.
  • chr, coor, strand: the chromosome, chromosome position and strand of the codon (2nd position).
  • s4pred: protein secondary structure prediction by S4Pred at the substituted amino acid; C: coil, E: beta-sheet, H: alpha helix.
  • C.protein: total number of S4Pred-predicted C in the protein.
  • E.protein: total number of S4Pred-predicted E in the protein.
  • H.protein: total number of S4Pred-predicted H in the protein.
  • iupred3: IUPred3-based disordered score at the substituted amino acid.
  • iupred3.protein: mean IUPred3 score of the whole protein.
  • anchor2: ANCHOR2 (via IUPred3) score of disordered protein interaction at the substituted amino acid.
  • anchor2.protein: mean ANCHOR2 score of the whole protein.
  • MMSeq2: MMSeq2-based sequence conservation score, via the describePROT database.
  • ASAquick: Accessible Surface Area score, via the describePROT database.
  • DisoRDPbind: disordered protein binding score (similar to anchor2), via the describePROT database.
  • SCRIBER: protein-binding prediction score, via the describePROT database.
  • flDPnn: disordered score (similar to iupred3), via the describePROT database.
  • AA: -25 to +25 amino acid sequence around the substituted amino acid, - fills up positions beyond the protein ends, such that the central position still corresponds to the substituted amino acid.
  • NT: the transcript sequence coding for the amino acids in AA.
  • pfam: custom hmmer3-based Pfam domains that overlap with the substituted amino acid.
  • clan: Pfam clans of the domains in column pfam.
  • pfam.ebi: Pfam domains downloaded from InterPro that overlap with the substituted amino acid.
  • clan.ebi: Pfam clans of the domains in column pfam.ebi.