-
Notifications
You must be signed in to change notification settings - Fork 2
Classify genes
You need two inputs to classify gene sequences:
- a fasta file with the genes to annotate;
- a stag database constructed for the genes that you want to annotate.
The fasta file (let's say unknown_seq.fasta
) that contains the marker genes to annotate; would look like this:
>geneA
ATCTTACGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCA
>geneB
GGATCGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGAGCATTCGT
>geneC
AAGCTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCAA
The stag database should be trained on the same gene families of the sequence to annotate. For example, the 16S gene is a marker gene that is widely used, and there are stag databases available for this marker gene. See Marker genes databases for a full list.
Other marker genes are used for building taxonomic trees and can be used for stag. If you don't find your favorite marker gene in the list of available databases, you can build your own database: Build a STAG database.
Given a stag database test_db.stagDB
, you can annotate the sequences in unknown_seq.fasta
with:
stag classify -d test_db.stagDB -i unknown_seq.fasta
The output is:
sequence taxonomy
geneA d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB d__Bacteria
geneC d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
Where there are two columns, the gene and the taxonomy annotation.
You can re-direct the output with -o
. For example, with the call:
stag classify -d test_db.stagDB -i unknown_seq.fasta -o res.tax
The taxonomy annotation results are saved in res.tax
.
Many hmms (which are used at the basis of the stag database) are trained on proteins. It's possible in stag
to align protein sequences and then convert it to a nucleotide alignment that is used for the classification. For this you will need the -p
command.
In this case we need three inputs:
- the nucleotide fasta file,
- the protein fasta file,
- the stag database.
Note that the nucleotide and protein fasta files should have the same number of sequences and the same headers (not necessarily in the same order).
Example gene input (unk_seq.fna
):
>RS_GCF_003862755.1
ATGGCTGCCCAAAAAAATGACGGTGCAAAAACTAAAAAAAGGTTTAGTTTGTTTGGATACTTAAAAGAAACTAAACAAGAATTAAAAAGAGTTACATGGCCGACAAAAAAGGAACTGTTTAAAAACACAAGTATAGTTTTAACTGTAGTTATTTCATGCACTATATTAGTATGGGGTATAGACACTATATTATCTGGTGCACTAGCGTTATTACTAAAATAG
>RS_GCF_003336745.1
ATGTTTACCCGGATTGTTCGCTACTTCCAGGAGGCCCGGGCCGAGCTTGCCCGGGTCACCTGGCCCACGCGGGAGCAGATCGTGGAGGGCACCCAGGCCATTTTGGTTTTCACCGTGGTGTCCATGGTGATCCTGGGGTTCTACGATCTCGTCTTCCGGTTCCTGATAGGGCTTGTGCGATGA
Example protein input (unk_seq.faa
):
>RS_GCF_003862755.1
MAAQKNDGAKTKKRFSLFGYLKETKQELKRVTWPTKKELFKNTSIVLTVVISCTILVWGIDTILSGALALLLK*
>RS_GCF_003336745.1
MFTRIVRYFQEARAELARVTWPTREQIVEGTQAILVFTVVSMVILGFYDLVFRFLIGLVR*
You can then annotate the sequences with:
stag classify -d test_db.stagDB -i unk_seq.fna -p unk_seq.faa
Using the -l
option is possible to print more information. For example, with the same 3 genes as in the basic function, if we call:
stag classify -d test_db.stagDB -i unknown_seq.fasta -l
We obtain:
sequence taxonomy full_taxonomy selected_level prob_from_classifiers prob_per_level n_aligned_characters
geneA d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus d__Bacteria/p__Firmicutes/c__Bacilli/o__Staphylococcales/f__Staphylococcaceae/g__Staphylococcus/s__Staphylococcus simiae 5 1/0.9996797005962708/0.9996797005962708/0.9977143628716445/0.9979136086663344/0.9981659184151056/0.5 0:0.006393431853422304/1:0.005266701312510876/2:0.011690768768250214/3:0.031145855210536134/4:0.08039881801318638/5:0.6854467029792982/6:0.3791790859162732 327
geneB d__Bacteria d__Bacteria/p__Fusobacteriota/c__Fusobacteriia/o__Fusobacteriales/f__Leptotrichiaceae/g__Leptotrichia/s__Leptotrichia massiliensis 0 1/0.5317705490518341/0.5317705490518341/0.5317705490518341/0.8271658064870854/0.8008692773154671/0.5 0:0.8258755469556766/1:0.05530574892561932/2:0.004060494630268627/3:0.0007414667827310939/4:0.004223877043888062/5:0.02033437410752632/6:0.02010865554565433 318
geneC d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria d__Bacteria/p__Proteobacteria/c__Gammaproteobacteria/o__SZUA-21/f__SZUA-21/g__SZUA-41/s__SZUA-41 sp003233075 2 1/0.822081586782837/0.9339442170905476/0.5/0.5/0.5/0.5 0:0.07472817462386734/1:0.026932445258896206/2:0.43918667728487215/3:0.047259797346965476/4:0.01603450820878486/5:0.0029800040806866957/6:0.0030950123752895753 325
Let's analyse the result for the first gene, by column:
1. sequence
: name of the gene:
geneA
2. taxonomy
: Assigned taxonomy
d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
3. full_taxonomy
: Taxonomy till the species level:
d__Bacteria/p__Firmicutes/c__Bacilli/o__Staphylococcales/f__Staphylococcaceae/g__Staphylococcus/s__Staphylococcus simiae
4. selected_level
: Where to to stop in the taxonomy classification:
5
In this case, 5 means genus (we start from 0 kingdom to 6 species).
5. prob_from_classifiers
: The probability returned from the node classifiers:
1/0.9996797005962708/0.9996797005962708/0.9977143628716445/0.9979136086663344/0.9981659184151056/0.5
6. prob_per_level
: The probability to select the correct level:
0:0.006393431853422304/1:0.005266701312510876/2:0.011690768768250214/3:0.031145855210536134/4:0.08039881801318638/5:0.6854467029792982/6:0.3791790859162732
Where 0:0.006393431853422304
means that the probability that we should stop at the 0
level is 0.0063
. The highest probability is 0.6854
of level 5
, and that is why selected_level
is 5
and we stop at the genus level.
7. n_aligned_characters
: number of characters that map to internal states of the hmm:
327