Deep MSA and Statistical Coupling Analysis

Jump to bottom

Belen Sundberg edited this page Aug 23, 2023 · 6 revisions

1. Compile a database of homologous sequences

All of the scripts I used are available in /ifs/scratch/home/mm6732/. They will need to be modified with your correct file paths before running.

A. Run phmmer

Download phmmer
Phmmer compares a query sequence to a database of protein sequences. The database used to compare sequences to is stored in the server at /ifs/data/glab/uniref90/uniref90.fasta
Use the command phmmer -o output.txt query_protein.fasta /path/to/database which takes in an input amino acid sequence fasta file and returns a hmmer txt file with ranked homologs

B. Filter homologs by Enzyme Commission (EC) number. This makes sure all of your homologs catalyze the same reaction.

The phmmer output only contains the accession numbers, not EC numbers or sequences, so you will need to map the accession number to EC numbers.
Start by making a copy of the phmmer output text file and adding an additional column for ec numbers by mapping accession numbers to the uniref database. [accession_to_ec.py]
Then filter the original phmmer output text file to only include accession numbers that map to the ec number for your protein. [filter_phmmer_ec.py]

C. Convert phmmer text file to fasta file with sequences

phmmer_to_fasta.py

D. Filter for unique sequences

checkuniquespecies.py

2. Make an MSA

Install mafft
Command line instructions for running mafft are available on their website with different options for algorithms. L-insi tends to be faster than E-insi
Example using L-insi and all 128 server threads: mafft --thread 128 --localpair pfk.fasta > pfk.aln

3. Statistical Coupling Analysis

Follow all the steps for installation, processing, and doing calculations available on the pySCA website https://ranganathanlab.gitlab.io/pySCA/install/
My code for visualizing and bootstrapping SCA data is on the server at share/PFK_Project/melody/pySCA/data