Bash and R scripts to analyze the protein and genomic context of sites of amino acid substitution, as provided by Dr. Shiri Tsour.
All figures and data tables for raim's part
of the analysis can be generated by running the blast-based mapping
pipeline in run_blast.sh
and the analysis pipeline
in run_analysis.sh
.
Figures only: If you are only interested in the downstream
analysis and generation of published figures, you just need to
download
the additionalData.zip
(step 3 below), and run
the R scripts called from run_analysis.sh
(step
5) in your favorite R environment with these five required
packages. Simply adjust the proj.path
variable
in raas_init.R
, the directory where you unpacked
additionalData.zip
.
To run the full mapping and analysis pipeline you need to follow these steps:
- Make sure all software dependencies are met,
Bash scripts should run on a standard Linux terminal (cat
, grep
,
cut
, awk
, gunzip
, rsync
), and additionally require NCBI's
blast (we used version 2.15.0+) and a recent R version (>3.6.3) to
be installed. For R, see install_packages.R
.
On Debian/Ubuntu this may work, but likely comes with a different version of blast:
sudo apt-get update
sudo apt install ncbi-blast+
sudo apt install r-base
## install five R packages
R --vanilla < install_packages.R
- Choose a location where you want to store input and output data, and tell the bash terminal about it:
export DECODE=/home/raim/data/decode
mkdir $DECODE
where you replace the path by your favored location.
- Download the
additionalData.zip
(described below) and unpack it into this new directory$DECODE
:
unzip <DOWNLOADPATH>/additionalData.zip -d $DECODE
- Run the bash script that blasts all BP/SAAP data, and collects various structural information for the best-matching protein sites, and its transcript and genome coordinates.
./run_blast.sh
and follow instructions, if errors occur. Note, that log files are
generated in $DECODE/log
. Check these, if unsure what the errors are. At
this point, some QC figures were generated in $DECODE/figures
. All
generated data is in $DECODE/processedData
.
NOTE, that the last step in this script additionally requires a
large external data base to generate
saap_mapped.tsv
. This external data
is generated by the script setup.sh
in the project folder
data/mammary/
of the https://gitlab.com/raim/genomeBrowser/ project
(release RAAS_preprint
). Since this involves many steps and high
performance computing, we provide saap_mapped.tsv
with the
additionalData.zip
.
If you do want to re-run the full pipeline, you need to generate this
input data or modify the R script
map_peptides.R
, and the downstream analysis
scripts to not use all data. Many analyses do not require all data. To
use a freshly generated saap_mapped.tsv
in downstream analyses (next
step), you need to copy it from the processedData to the
additionalData folder, overwriting the downloaded original, or adjust
the path for this file in raas_init.R
.
- Run the bash script that analyses data and generates results plots
./run_analysis.sh
or alternatively open the R scripts called from this bash script in
Rstudio, make sure data is available as above, and modify the main
input/output path in the file raas_init.R
.
The archive additionalData.zip
contains the following files, all required at different stages of the blast pipeline or the analysis.
- Tab-delimited and unfiltered versions of three official supplemental
data tables; these include BP/SAAP from immunoglobulin genes and
other artefacts:
All_SAAP_TMTlevel_quant_df.txt.gz
(Supplemental_Data_3.SAAP_precursor_quant.xlsx
),All_SAAP_patient_level_quant_df.txt.gz
(Supplemental_Data_4.SAAP_reporter_quant.xlsx
),All_SAAP_protein_filter_df.txt.gz
(Supplemental_Data_2.SAAP_proteins.xlsx
),
saap_mapped.tsv.gz
: The main blast-based mapping file that holds various collected information for each unique BP/SAAP pair. The columns of this file are detailed below.- Three Supplemental Data tables from other publications:
- Mathieson et
al. 2018,
Supplementary Data 2: protein half-lives,
41467_2018_3106_MOESM5_ESM.xlsx
, - Wu et al. 2019, Figure 1
source data 2: codon stability coefficients,
elife-45396-fig1-data2-v2.csv.gz
, - McCormick et
al. 2024, Supplementary Table 1: psi sites in six cell lines, as provided by the authors,
six_cell_lines_minimal.xlsx
.
- Mathieson et
al. 2018,
Supplementary Data 2: protein half-lives,
- Human genome data downloaded and generated by scripts in
https://gitlab.com/raim/genomeBrowser, project path
data/mammary
, and releaseRAAS_preprint
; see scriptdata/mammary/setup.sh
, which calls all download and processing steps:- Various Ensembl <-> refseq/uniprot/synonym mappings,
uniprot_ensembl.dat.gz
,uniprot_name.dat.gz
,gene_synonyms.tsv.gz
,ensembl_refseq_20240528.tsv.gz
,protein_transcript_map.tsv.gz
, - A GOslim term definition file,
goslim.tsv.gz
, - A fasta of all coding regions of all Ensembl transcripts,
coding.fa.gz
, - Codon counts for all Ensembl transcripts,
coding_codons.tsv.gz
, - A chromosome length/name index file,
sequenceIndex.csv.gz
, - A human genome feature table,
features_GRCh38.110.tsv.gz
, - A human genome tRNA table,
codons_GRCh38.tsv.gz
.
- Various Ensembl <-> refseq/uniprot/synonym mappings,
The file saap_mapped.tsv
provides the blast-based re-mapping of base
peptides (BP) to the Ensembl human genome version GCRh38, release 110,
and including proteins with patient-specific single amino acid
substitutions detected by one of two variant calling pipelines.
BP
: "base peptide" as encoded by the genome or by patient-specific transcripts, these were blasted against the Ensembl proteome (GRCh38, release 110) fasta, supplemented with mutated proteins detected by one of two variant calling pipelines.SAAP
: "substituted amino acid peptide", same as BP with one AA difference.protein
: Ensembl protein ID of the main selected blast hit, which includes proteins with patient-specific substitutions, indicated by suffixes (_Q23G
); Note, that these names were shortened to 47 chars but are unique, full original names (from the variant calling pipeline) are available if required.identity
, ...,bitscore
: blast output.ensembl
: the Ensembl protein ID without suffixes.gene
: the Ensembl ID of the gene that encodes for the transcript and protein.transcript
: the Ensembl ID for the transcript encoding for the protein in column ensembl.MANE
: the Ensembl transcript ID for the MANE subset of transcripts associated with the same gene, mostly but not always identical with thetranscript
column. If it differs the blast hit was NOT for the MANE isoform of the same gene.MANE.protein
: the Ensembl ID for the protein encoded by the MANE transcript.numGO
: number of GO terms associated with the gene.match
: classification of the blast hit quality,good
where full-length matches with 100% identity and 0 mismatches, only those should be used for analysis of protein, transcript or genome coordinates. Note, that we were not able to account for mutations detected by the second variant calling pipeline. Doing this should remove thebad
(<100 % identity, >0 mismatches, or not a full length hit) andwrong
(>2 mismatches) blast hits.nogene
: no Ensembl gene could be mapped.IG
: the blast hit is tagged as an immunoglobulin (patternbiotype=IG_.*
in the Ensembl gff3 file); since these are expected to be mutated in individual immune cells but not detected by patient-specific genome or transcript sequencing these ARE EXCLUDED from all analyses.albumin
: contains "albumin" in the gff3 file's description column.globin
: contains "globin" in the gff3 file's description column.extracellular
: the gene is is annotated with the GO termGO:0005576
for "extracellular region".exclude
: outdated tag to exclude the BP/SAAP from analyses, if eitherIG
,albumin
orglobin
is TRUE.name
: human-readable gene name (Name=
tag in the gff3 description column).site
: position of the substituted amino acid in the BP and the SAAP,pos
: position of the substituted amino acid in the protein indicated in columnprotein
(blast hit).len
: length of the protein.rpos
:pos
/len
, relative position of the substituted amino acid in the protein.from
: the encoded amino acid (in the BP).to
: the substituted amino acid (in the SAAP).codon
: the codon coding for the encoded amino acid (from
) in the matching Ensembltranscript
.tpos
: the position of thecodon
(2nd position) in the Ensembl transcript sequence as provided in the fasta file:Homo_sapiens.GRCh38.cdna.all.fa.gz
.chr
,coor
,strand
: the chromosome, chromosome position and strand of thecodon
(2nd position).s4pred
: protein secondary structure prediction by S4Pred at the substituted amino acid; C: coil, E: beta-sheet, H: alpha helix.C.protein
: total number of S4Pred-predicted C in theprotein
.E.protein
: total number of S4Pred-predicted E in theprotein
.H.protein
: total number of S4Pred-predicted H in theprotein
.iupred3
: IUPred3-based disordered score at the substituted amino acid.iupred3.protein
: mean IUPred3 score of the whole protein.anchor2
: ANCHOR2 (via IUPred3) score of disordered protein interaction at the substituted amino acid.anchor2.protein
: mean ANCHOR2 score of the whole protein.MMSeq2
: MMSeq2-based sequence conservation score, via the describePROT database.ASAquick
: Accessible Surface Area score, via the describePROT database.DisoRDPbind
: disordered protein binding score (similar toanchor2
), via the describePROT database.SCRIBER
: protein-binding prediction score, via the describePROT database.flDPnn
: disordered score (similar toiupred3
), via the describePROT database.AA
: -25 to +25 amino acid sequence around the substituted amino acid,-
fills up positions beyond the protein ends, such that the central position still corresponds to the substituted amino acid.NT
: the transcript sequence coding for the amino acids inAA
.pfam
: customhmmer3
-based Pfam domains that overlap with the substituted amino acid.clan
: Pfam clans of the domains in columnpfam
.pfam.ebi
: Pfam domains downloaded from InterPro that overlap with the substituted amino acid.clan.ebi
: Pfam clans of the domains in columnpfam.ebi
.