Skip to content

Classify genes

Alessio Milanese edited this page Aug 7, 2020 · 19 revisions

Basic function

You need two inputs to classify gene sequences:

  • a fasta file with the genes to annotate;
  • a stag database constructed for the genes that you want to annotate.

The fasta file (let's say unknown_seq.fasta) that contains the marker genes to annotate; would look like this:

>geneA
ATCTTACGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCA
>geneB
GGATCGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGAGCATTCGT
>geneC
AAGCTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCAA

The stag database should be trained on the same gene families of the sequence to annotate. For example, the 16S gene is a marker gene that is widely used, and there are stag databases available for this marker gene. See Marker genes databases for a full list.

Other marker genes are used for building taxonomic trees and can be used for stag. If you don't find your favorite marker gene in the list of available databases, you can build your own database: Build a STAG database.

Given a stag database test_db.stagDB, you can annotate the sequences in unknown_seq.fasta with:

stag classify -d test_db.stagDB -i unknown_seq.fasta

The output is:

sequence	taxonomy
geneA	d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB	d__Bacteria
geneC	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

Where there are two columns, the gene and the taxonomy annotation.

You can re-direct the output with -o. For example, with the call:

stag classify -d test_db.stagDB -i unknown_seq.fasta -o res.tax

The taxonomy annotation results are saved in res.tax.

Use protein sequences

Many hmms (which are used at the basis of the stag database) are trained on proteins. It's possible in stag to align protein sequences and then convert it to a nucleotide alignment that is used for the classification. For this you will need the -p command.

In this case we need three inputs:

  • the nucleotide fasta file,
  • the protein fasta file,
  • the stag database.

Note that the nucleotide and protein fasta files should have the same number of sequences and the same headers (not necessarily in the same order). Example gene input (unk_seq.fna):

>RS_GCF_003862755.1
ATGGCTGCCCAAAAAAATGACGGTGCAAAAACTAAAAAAAGGTTTAGTTTGTTTGGATACTTAAAAGAAACTAAACAAGAATTAAAAAGAGTTACATGGCCGACAAAAAAGGAACTGTTTAAAAACACAAGTATAGTTTTAACTGTAGTTATTTCATGCACTATATTAGTATGGGGTATAGACACTATATTATCTGGTGCACTAGCGTTATTACTAAAATAG
>RS_GCF_003336745.1
ATGTTTACCCGGATTGTTCGCTACTTCCAGGAGGCCCGGGCCGAGCTTGCCCGGGTCACCTGGCCCACGCGGGAGCAGATCGTGGAGGGCACCCAGGCCATTTTGGTTTTCACCGTGGTGTCCATGGTGATCCTGGGGTTCTACGATCTCGTCTTCCGGTTCCTGATAGGGCTTGTGCGATGA

Example protein input (unk_seq.faa):

>RS_GCF_003862755.1
MAAQKNDGAKTKKRFSLFGYLKETKQELKRVTWPTKKELFKNTSIVLTVVISCTILVWGIDTILSGALALLLK*
>RS_GCF_003336745.1
MFTRIVRYFQEARAELARVTWPTREQIVEGTQAILVFTVVSMVILGFYDLVFRFLIGLVR*

You can then annotate the sequences with:

stag classify -d test_db.stagDB -i unk_seq.fna -p unk_seq.faa

Print more information about the classification

Using the -l option is possible to print more information. For example, with the same 3 genes as in the basic function, if we call:

stag classify -d test_db.stagDB -i unknown_seq.fasta -l

We obtain:

sequence	taxonomy	full_taxonomy	selected_level	prob_from_classifiers	prob_per_level
geneA	d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus	d__Bacteria/p__Firmicutes/c__Bacilli/o__Staphylococcales/f__Staphylococcaceae/g__Staphylococcus/s__Staphylococcus simiae	5	1/0.9996797005962708/0.9996797005962708/0.9977143628716445/0.9979136086663344/0.9981659184151056/0.5	0:0.006393431853422304/1:0.005266701312510876/2:0.011690768768250214/3:0.031145855210536134/4:0.08039881801318638/5:0.6854467029792982/6:0.3791790859162732
geneB	d__Bacteria	d__Bacteria/p__Fusobacteriota/c__Fusobacteriia/o__Fusobacteriales/f__Leptotrichiaceae/g__Leptotrichia/s__Leptotrichia massiliensis	0	1/0.5317705490518341/0.5317705490518341/0.5317705490518341/0.8271658064870854/0.8008692773154671/0.5	0:0.8258755469556766/1:0.05530574892561932/2:0.004060494630268627/3:0.0007414667827310939/4:0.004223877043888062/5:0.02033437410752632/6:0.02010865554565433
geneC	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria	d__Bacteria/p__Proteobacteria/c__Gammaproteobacteria/o__SZUA-21/f__SZUA-21/g__SZUA-41/s__SZUA-41 sp003233075	2	1/0.822081586782837/0.9339442170905476/0.5/0.5/0.5/0.5	0:0.07472817462386734/1:0.026932445258896206/2:0.43918667728487215/3:0.047259797346965476/4:0.01603450820878486/5:0.0029800040806866957/6:0.0030950123752895753

Let's analyse the result for the first gene, by column:

1. sequence: name of the gene:

geneA

2. taxonomy: Assigned taxonomy

d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus

3. full_taxonomy: Taxonomy till the species level:

d__Bacteria/p__Firmicutes/c__Bacilli/o__Staphylococcales/f__Staphylococcaceae/g__Staphylococcus/s__Staphylococcus simiae

4. selected_level: Where to to stop in the taxonomy classification:

5

In this case, 5 means genus (we start from 0 kingdom to 6 species).

5. prob_from_classifiers: The probability returned from the node classifiers:

1/0.9996797005962708/0.9996797005962708/0.9977143628716445/0.9979136086663344/0.9981659184151056/0.5

6. prob_per_level: The probability to select the correct level:

0:0.006393431853422304/1:0.005266701312510876/2:0.011690768768250214/3:0.031145855210536134/4:0.08039881801318638/5:0.6854467029792982/6:0.3791790859162732

Where 0:0.006393431853422304 means that the probability that we should stop at the 0 level is 0.0063. The highest probability is 0.6854 of level 5, and that is why selected_level is 5 and we stop at the genus level.

Clone this wiki locally