Skip to content

Running MPNN with PSSM bias

Andrew Reckers edited this page Oct 12, 2023 · 1 revision

I modified one MPNN file to get this to run, because the out-of-the-box way to run this requires a .npz of the PSSM. Rather than convert the PSSM to .npz, I made the pssm.jsonl separately in make_pssm_dict_ARedit.py. All these files are in projects/andrew

  1. Make .fasta of the sequences to be included in the PSSM

  2. Make blast db:

  • install psi-blast with conda install -c bioconda blast
  • makeblastdb -in /ifs/scratch/home/arr2230/data/mpnn/ar23_pssm/nh160to168_ar23first.fasta -dbtype prot -parse_seqids -out makeblastdb -in /ifs/scratch/home/arr2230/data/mpnn/ar23_pssm/nh160to168_ar23first.fasta -dbtype prot -parse_seqids -out /ifs/scratch/home/arr2230/data/mpnn/ar23_pssm/db/ar23
    • makes a handful of files that are all part of the db
  1. Make pssm:
  • cd into the folder with the makeblastdb output files, and refer to all those created db files using the prefix specified in the out command in the previous step:
  • psiblast -db ar23 -query /ifs/scratch/home/arr2230/data/mpnn/ar23_pssm/ar23.fasta -num_iterations 3 -out_ascii_pssm output.pssm

4.Parse original pdb into jsonl format

  • using submit_example_parse_chains.sh in helper_scripts, parse the starting pdb into parsed_pdbs.jsonl

5.Make pssm.jsonl

  • Run helper_scripts/other_tools/make_pssm_dict_ARedit.py
  • inputs:
    • unconverted pssm (maybe still called output.pssm)
    • the parsed pdb in jsonl format from previous step (maybe still called parsed_pbbs.jsonl)
  • output:
    • the converted pssm.jsonl

6.Run MPNN

  • Run helper_scripts/submit_pssm_ar23.sh
  • inputs:
    • the converted pssm.jsonl
    • folder with reference pdb
  • To run multiple bias values (called pssm_multi), use helper_scripts/run_submit_pssm.sh

Side note: How to interpret the PSSM: Information Content: This column measures the information at each position in the matrix, which is a reflection of how conserved that position is. A higher value indicates that the position is more conserved (less variable) across the sequences in the alignment. Composition: This column provides a measure related to the bias or background frequency of amino acids at that position. In summary: Columns 1-20: PSSM scores for the 20 standard amino acids. Columns 21-40: Raw counts or frequencies for each amino acid at that position. Column 41: Information content for each position. Column 42: Composition or bias measure for each position.

Clone this wiki locally