-
Notifications
You must be signed in to change notification settings - Fork 1
MMseqs2: Make MSA and analyze taxonomy
Derived from MMseqs2 User Guide
MMseqs2 Make MSA from input fasta:
1. Get input fasta
2. Make query database from query fasta:
a. Mmseqs createdb <inputquery.fasta>
3. Searching:
a. "# The queryDB and queryDB.index files contain the amino acid sequences, while the queryDB_h and queryDB_h.index file contain the FASTA headers. The queryDB.lookup file contains a list of tab separated fields that map from the internal identifier to the FASTA identifiers. # The alignment consists of two steps the prefilter and **alignment. **Search as standard does compute the score only. If you need the alignment information add the option "-a"." (From MMseqs2 github)
b. mmseqs search <targetDB (ie. Uniref90)> tmp --max-seqs 20000 -e 0.1 --num-iterations 3 --db-load-mode 2 -a
c. If you want to just search your query in a database of eukaryotes for example, make a database of just eukaryotes from UniRef90 database:
i. mmseqs filtertaxseqdb <UniRef_EukOnlyDB> --taxon-list 2759
ii. 2759 is the taxonomy id for eukaryotes
iii. Then use this filteredtaxseqdb as your db for making the MSA
4. Compute a MSA for each cluster
a. mmseqs result2msa <i:queryDB> <i:targetDB> <i:resultDB> <o:msaDB>
b. _ Params I used (From Foldit github script):_ mmseqs result2msa <result.a3m> --msa-format-mode 6 --db-load-mode 2 --threads 100 --filter-msa 1 --filter-min-enable 1000 --diff 3000 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95
5. Convert .a3m to fasta (I used jalview)
MMseqs2 taxonomy:
1. Create Database from your MSA:
a. mmseqs createdb examples/QUERY.fasta queryDB
b. mmseqs createdb msa.fasta seqTaxDB_new
2. Reference MSA database to Taxonomy Database (UniRef90); map taxonomy IDs to MSA
a. mmseqs taxonomy queryDB seqTaxDB taxonomyResult tmp [--tax-linage options] b. mmseqs taxonomy seqTaxDB_new UniRef90 taxonomyResult tmp --tax-lineage 2
i. The mode --tax-lineage 1 will add a column with the full lineage names, prefixed with their short rank (e.g., -_cellular organisms;d_Eukaryota;...;g_Saccharomyces;s_Saccharomyces cerevisiae) and mode --tax-lineage 2 will add a column with the full lineage NCBI taxids (e.g., 131567;2759;...;4930;4932)
3. Convert taxonomyResult to .tsv file
a. mmseqs createtsv queryDB taxonomyResult taxonomyResult.tsv b. mmseqs createtsv seqTaxDB_new ./TaxonomyResults_BS/taxonomyResult taxonomyresult.tsv
4. Make taxonomy report file:
a. mmseqs taxonomyreport seqTaxDB taxonomyResult taxonomyResult_report
b. mmseqs taxonomyreport ../databases/UniRef90 taxonomyResult taxonomyResult_report
c. Use this to make taxonomy tree via Docker or Pavian
5. You can sort the sequences in your MSA by taxonomy (ie. if you only want to look at Eukaryotes). Make a list of sequences from your MSA that are only Euk; add taxonomy info:
a. mmseqs filtertaxdb seqTaxDB taxonomyResult taxonomyResult.virus --taxon-list b. mmseqs filtertaxdb /ifs/scratch/home/cl4354/app/mmseq2/databases/UniRef90 ./TaxonomyResults_BS/taxonomyResult taxonomyResultEuk --taxon-list 2759
6. Convert taxonomyResult to .tsv file
a. mmseqs createtsv queryDB taxonomyResult taxonomyResult.tsv
If you want to label your MSA with taxonomy info from your .tsv taxonomyResult, I wrote a script to do this. Ask for Belen's github
- New member onboarding
- Lab jobs
- Seminar schedules
- How to order
- Group meeting schedule
- Lab notebooks
- Funding opportunities
- Philosophy of science
- Wet lab basics
- Lab safety
- Waste disposal
- Chemical inventory
- -20C inventory
- Molecular biology
- Buffers and reagents
- Protocols library
- DNA synthesis and primers
- 80C freezer organization
- Using server
- C2B2 HPC access
- Update lab website
- Cluster parallel processing
- Mercury at CUIMC
- Getting started with Rosetta
- Install Pyrosetta
- Tutorials
- Clone Github
- Gromacs-Tutorial
- Cluster Specs
- Deep MSA and Statistical Coupling Analysis
- MMseqs2: Make MSA and analyze taxonomy
- Useful tools