-
Notifications
You must be signed in to change notification settings - Fork 2
Build STAG database for genes
Alessio Milanese edited this page Sep 21, 2020
·
2 revisions
You need three input:
- a set of reference sequences to use to learn the taxonomy
- a taxonomy file for the sequences in point 1
- a hmm file for the sequences in point 1 (check also: What to do if you don't have a hmm file)
The reference sequences should be in fasta format (example file):
>gene1
ATATGCATTTTACGATATGCA...
>gene2
GCATTATTTCAGGGCTAGGCA...
>gene3
CCGGATTGGGATCAAAAAGCG...
The taxonomy file should contain the same ids as the fasta file and the taxonomy as a tab separated file (where the taxonomy is separated by ";"), like:
gene\tKingdom;Phylum;Class;...
Example (example file):
gene1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
gene2 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes
gene3 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus suis
Example HMM file: (example file).
To check that your files are correct, you can run:
stag check_input -i <fasta_seqs> -x <taxonomy_file> -a <hmmfile>
Once there are no errors, you can run:
stag train -i <fasta_seqs> -x <taxonomy_file> -a <hmmfile> -o test_db.stagDB
During the creation of the database, a log file is saved, with the same name as the database + ".log". The training takes between 1 hour and 4 hours (depending on the number of sequences).
For ~40k sequences of ~1,500 nucleotides it takes around 3 hours. You can check the time from the log file:
cat <db_output_file>.log | grep "MAIN"
[2020-08-02 18:30:17,493] MAIN:Load taxonomy
[2020-08-02 18:30:17,944] MAIN:Load alignment
[2020-08-02 18:31:40,047] MAIN:Check taxonomy and alignment
[2020-08-02 18:31:43,295] MAIN:Train all classifiers
[2020-08-02 18:38:52,453] MAIN:Learn taxonomy selection function
[2020-08-02 21:14:22,120] MAIN:Save to file
[2020-08-02 21:14:58,667] MAIN:Finished