Skip to content

MMseqs2: Make MSA and analyze taxonomy

Belen Sundberg edited this page Nov 9, 2023 · 6 revisions

Derived from MMseqs2 User Guide

MMseqs2 Make MSA from input fasta:

1. Get input fasta

2. Make query database from query fasta:

a. Mmseqs createdb <inputquery.fasta>

3. Searching:

a. "# The queryDB and queryDB.index files contain the amino acid sequences, while the queryDB_h and queryDB_h.index file contain the FASTA headers. The queryDB.lookup file contains a list of tab separated fields that map from the internal identifier to the FASTA identifiers. # The alignment consists of two steps the prefilter and **alignment. **Search as standard does compute the score only. If you need the alignment information add the option "-a"." (From MMseqs2 github)

b. mmseqs search <targetDB (ie. Uniref90)> tmp --max-seqs 20000 -e 0.1 --num-iterations 3 --db-load-mode 2 -a

c. If you want to just search your query in a database of eukaryotes for example, make a database of just eukaryotes from UniRef90 database:

i. mmseqs filtertaxseqdb <UniRef_EukOnlyDB> --taxon-list 2759

ii. 2759 is the taxonomy id for eukaryotes

iii. Then use this filteredtaxseqdb as your db for making the MSA

4. Compute a MSA for each cluster

a. mmseqs result2msa <i:queryDB> <i:targetDB> <i:resultDB> <o:msaDB>

b. _ Params I used (From Foldit github script):_ mmseqs result2msa <result.a3m> --msa-format-mode 6 --db-load-mode 2 --threads 100 --filter-msa 1 --filter-min-enable 1000 --diff 3000 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95

5. Convert .a3m to fasta (I used jalview)


MMseqs2 taxonomy:

1. Create Database from your MSA:

a. mmseqs createdb examples/QUERY.fasta queryDB

b. mmseqs createdb msa.fasta seqTaxDB_new

2. Reference MSA database to Taxonomy Database (UniRef90); map taxonomy IDs to MSA

a. mmseqs taxonomy queryDB seqTaxDB taxonomyResult tmp [--tax-linage options] b. mmseqs taxonomy seqTaxDB_new UniRef90 taxonomyResult tmp --tax-lineage 2

i. The mode --tax-lineage 1 will add a column with the full lineage names, prefixed with their short rank (e.g., -_cellular organisms;d_Eukaryota;...;g_Saccharomyces;s_Saccharomyces cerevisiae) and mode --tax-lineage 2 will add a column with the full lineage NCBI taxids (e.g., 131567;2759;...;4930;4932)

3. Convert taxonomyResult to .tsv file

a. mmseqs createtsv queryDB taxonomyResult taxonomyResult.tsv b. mmseqs createtsv seqTaxDB_new ./TaxonomyResults_BS/taxonomyResult taxonomyresult.tsv

4. Make taxonomy report file:

a. mmseqs taxonomyreport seqTaxDB taxonomyResult taxonomyResult_report

b. mmseqs taxonomyreport ../databases/UniRef90 taxonomyResult taxonomyResult_report

c. Use this to make taxonomy tree via Docker or Pavian

5. You can sort the sequences in your MSA by taxonomy (ie. if you only want to look at Eukaryotes). Make a list of sequences from your MSA that are only Euk; add taxonomy info:

a. mmseqs filtertaxdb seqTaxDB taxonomyResult taxonomyResult.virus --taxon-list b. mmseqs filtertaxdb /ifs/scratch/home/cl4354/app/mmseq2/databases/UniRef90 ./TaxonomyResults_BS/taxonomyResult taxonomyResultEuk --taxon-list 2759

6. Convert taxonomyResult to .tsv file

a. mmseqs createtsv queryDB taxonomyResult taxonomyResult.tsv

If you want to label your MSA with taxonomy info from your .tsv taxonomyResult, I wrote a script to do this. Ask for Belen's github

Clone this wiki locally