-
Notifications
You must be signed in to change notification settings - Fork 2
Classify genomes
You need three inputs:
- the genome fasta file(s) (
-i
or-D
) - a STAG database for genomes (
-d
) - the path where to save the result (
-o
)
You can provide the input genome fasta with -i
, if you want to classify only one genome (example unknown_genome.fa
):
stag classify_genome -i unknown_genome.fa [...]
You can give as input all fasta files inside a directory with the -D
command:
stag classify_genome -D directory/with/fasta/genomes [...]
Second, you need the STAG database (-d
). You can either use one that we already compiled (link), or you can create one yourself (link).
Finally, you need to provide a path where to save the result:
stag classify_genome -o result/dir [...]
The directory result/dir
contains three directories and one file:
- file
genome_annotation
contains the annotation for each genome (one per line) - dir
genes_predictions
contains one file per genome, with the classification of the single marker genes. Note that if a marker gene was not present in the genome, then it will not be present in this file. - dir
MG_sequences
contains all the marker gene extracted from the genomes. There are two files per marker gene, one containing the gene sequences and one containing the protein sequences. - dir
MG_ali
containing the stag alignments of the genes.
NOTE: the genome_annotation
is obtained by first concatenating the alignments of the single genes and then classifying this. If the genome (or MAG) is not complete and some marker genes are missing, then the classification will be of poor quality. In this case, it would make sense to check the annotation of the single genes. We will implement a better way to merge this two annotations soon.
The genes are predicted with Prdigal and then selected with HMMR (using hmmsearch
). We select only the genes that are above a certain threshold.
Since the marker genes are expected to be present in single copy, by default we select only one gene per marker gene (above the threshold). Note that the gene selected is chosen randomly. With the command -r
is possible to taxonomically annotate all genes above the threshold. The result will be in the result dir (-o
) under genes_predictions
.