-
Notifications
You must be signed in to change notification settings - Fork 2
Classify genes
You need two inputs to classify gene sequences:
- a fasta file with the genes to annotate;
- a stag database constructed for the genes that you want to annotate.
The fasta file (let's say unknown_seq.fasta
) that contains the marker genes to annotate; would look like this:
>geneA
ATCTTACGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCA
>geneB
GGATCGTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGAGCATTCGT
>geneC
AAGCTTAGCTACGATCGATCGATTATTGCAGTTTCGATCGATCGATCGATCGGGCTATACACGATCAA
The stag database should be trained on the same gene families of the sequence to annotate. For example, the 16S gene is a marker gene that is widely used, and there are stag databases available for this marker gene. See Marker genes databases for a full list.
Other marker genes are used for building taxonomic trees and can be used for stag. If you don't find your favorite marker gene in the list of available databases, you can build your own database: Build a STAG database.
And a stag database test_db.stagDB
which was trained on the gene family of interest, you can annotate the sequences in unknown_seq.fasta
with:
stag classify -d test_db.stagDB -i unknown_seq.fasta
The output is:
sequence taxonomy
geneA d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB d__Bacteria
geneC d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
The stag database can be constructed, or you can check from the list of available
You can re-direct the output with -o
. For example, with the call:
stag classify -d test_db.stagDB -i unknown_seq.fasta -o res.tax
The taxonomy annotation results are saved in res.tax
.
Many hmms (which are used at the basis of the stag database) are trained on proteins. It's possible in stag
to align protein sequences and then convert it to a nucleotide alignment that is used for the classification. For this you will need the -p
command.
In this case we need three inputs:
- the nucleotide fasta file,
- the protein fasta file,
- the stag database.
Note that the nucleotide and protein fasta files should have the same number of sequences and the same headers (not necessarily in the same order).
Example gene input (unk_seq.fna
):
>RS_GCF_003862755.1
ATGGCTGCCCAAAAAAATGACGGTGCAAAAACTAAAAAAAGGTTTAGTTTGTTTGGATACTTAAAAGAAACTAAACAAGAATTAAAAAGAGTTACATGGCCGACAAAAAAGGAACTGTTTAAAAACACAAGTATAGTTTTAACTGTAGTTATTTCATGCACTATATTAGTATGGGGTATAGACACTATATTATCTGGTGCACTAGCGTTATTACTAAAATAG
>RS_GCF_003336745.1
ATGTTTACCCGGATTGTTCGCTACTTCCAGGAGGCCCGGGCCGAGCTTGCCCGGGTCACCTGGCCCACGCGGGAGCAGATCGTGGAGGGCACCCAGGCCATTTTGGTTTTCACCGTGGTGTCCATGGTGATCCTGGGGTTCTACGATCTCGTCTTCCGGTTCCTGATAGGGCTTGTGCGATGA
Example protein input (unk_seq.faa
):
>RS_GCF_003862755.1
MAAQKNDGAKTKKRFSLFGYLKETKQELKRVTWPTKKELFKNTSIVLTVVISCTILVWGIDTILSGALALLLK*
>RS_GCF_003336745.1
MFTRIVRYFQEARAELARVTWPTREQIVEGTQAILVFTVVSMVILGFYDLVFRFLIGLVR*
You can then annotate the sequences with:
stag classify -d test_db.stagDB -i unk_seq.fna -p unk_seq.faa