Skip to content

Classify genes

Alessio Milanese edited this page Aug 6, 2020 · 19 revisions

Basic function

You need two inputs to classify gene sequences:

  • a fasta file with the genes to annotate;
  • a stag database constructed for the genes that you want to annotate.

The fasta file (let's say unknown_seq.fasta) that contains the marker genes to annotate; would look like this:

>geneA
ATCTTACGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCA
>geneB
GGATCGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGAGCATTCGT
>geneC
AAGCTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCAA

The stag database should be trained on the same gene families of the sequence to annotate. For example, the 16S gene is a marker gene that is widely used, and there are stag databases available for this marker gene. See Marker genes databases for a full list.

Other marker genes are used for building taxonomic trees and can be used for stag. If you don't find your favorite marker gene in the list of available databases, you can build your own database: Build a STAG database.

And a stag database test_db.stagDB which was trained on the gene family of interest, you can annotate the sequences in unknown_seq.fasta with:

stag classify -d test_db.stagDB -i unknown_seq.fasta

The output is:

sequence	taxonomy
geneA	d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB	d__Bacteria
geneC	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

The stag database can be constructed, or you can check from the list of available

You can re-direct the output with -o. For example, with the call:

stag classify -d test_db.stagDB -i unknown_seq.fasta -o res.tax

The taxonomy annotation results are saved in res.tax.

Use protein sequences

Many hmms (which are used at the basis of the stag database) are trained on proteins. It's possible in stag to align protein sequences and then convert it to a nucleotide alignment that is used for the classification. For this you will need the -p command.

In this case we need three inputs:

  • the nucleotide fasta file,
  • the protein fasta file,
  • the stag database.

Note that the nucleotide and protein fasta files should have the same number of sequences and the same headers (not necessarily in the same order). Example gene input (unk_seq.fna):

>RS_GCF_003862755.1
ATGGCTGCCCAAAAAAATGACGGTGCAAAAACTAAAAAAAGGTTTAGTTTGTTTGGATACTTAAAAGAAACTAAACAAGAATTAAAAAGAGTTACATGGCCGACAAAAAAGGAACTGTTTAAAAACACAAGTATAGTTTTAACTGTAGTTATTTCATGCACTATATTAGTATGGGGTATAGACACTATATTATCTGGTGCACTAGCGTTATTACTAAAATAG
>RS_GCF_003336745.1
ATGTTTACCCGGATTGTTCGCTACTTCCAGGAGGCCCGGGCCGAGCTTGCCCGGGTCACCTGGCCCACGCGGGAGCAGATCGTGGAGGGCACCCAGGCCATTTTGGTTTTCACCGTGGTGTCCATGGTGATCCTGGGGTTCTACGATCTCGTCTTCCGGTTCCTGATAGGGCTTGTGCGATGA

Example protein input (unk_seq.faa):

>RS_GCF_003862755.1
MAAQKNDGAKTKKRFSLFGYLKETKQELKRVTWPTKKELFKNTSIVLTVVISCTILVWGIDTILSGALALLLK*
>RS_GCF_003336745.1
MFTRIVRYFQEARAELARVTWPTREQIVEGTQAILVFTVVSMVILGFYDLVFRFLIGLVR*

You can then annotate the sequences with:

stag classify -d test_db.stagDB -i unk_seq.fna -p unk_seq.faa
Clone this wiki locally