Build STAG database for genes

You need three input:

a set of reference sequences to use to learn the taxonomy
a taxonomy file for the sequences in point 1
a hmm file for the sequences in point 1 (check also: What to do if you don't have a hmm file)

The reference sequences should be in fasta format (example file):

>gene1
ATATGCATTTTACGATATGCA...
>gene2
GCATTATTTCAGGGCTAGGCA...
>gene3
CCGGATTGGGATCAAAAAGCG...

The taxonomy file should contain the same ids as the fasta file and the taxonomy as a tab separated file (where the taxonomy is separated by ";"), like:

gene\tKingdom;Phylum;Class;...

Example (example file):

gene1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
gene2 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes
gene3 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus suis

Example HMM file: (example file).

To check that your files are correct, you can run:

stag check_input -i <fasta_seqs> -x <taxonomy_file> -a <hmmfile>

Once there are no errors, you can run:

stag train -i <fasta_seqs> -x <taxonomy_file> -a <hmmfile> -o test_db.stagDB

During the creation of the database, a log file is saved, with the same name as the database + ".log". The training takes between 1 hour and 4 hours (depending on the number of sequences).

For ~40k sequences of ~1,500 nucleotides it takes around 3 hours. You can check the time from the log file:

cat <db_output_file>.log | grep "MAIN"
[2020-08-02 18:30:17,493] MAIN:Load taxonomy
[2020-08-02 18:30:17,944] MAIN:Load alignment
[2020-08-02 18:31:40,047] MAIN:Check taxonomy and alignment
[2020-08-02 18:31:43,295] MAIN:Train all classifiers
[2020-08-02 18:38:52,453] MAIN:Learn taxonomy selection function
[2020-08-02 21:14:22,120] MAIN:Save to file
[2020-08-02 21:14:58,667] MAIN:Finished

Home
Installation
Classify sequences
- Genes
- 16S amplicon data
- Genomes
Available databases
- Genes
- 16S amplicon data
- Genomes
Build a database
- Genes
- 16S amplicon data
- Genomes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build STAG database for genes

Clone this wiki locally