ConDiGA (Contigs Directed Gene Annotation) is an accurate taxonomic annotation pipeline from metagenomic data to construct accurate protein sequence databases for deep metaproteomic coverage.
You can install ConDiGA from Bioconda at https://anaconda.org/bioconda/condiga. Make sure you have conda
installed.
# create conda environment and install condiga
conda create -n condiga -c conda-forge -c bioconda condiga
# activate environment
conda activate condiga
You can install ConDiGA from PyPI at https://pypi.org/project/condiga/. Make sure you have pip
installed.
pip install condiga
Note: If you use pip to setup ConDiGA, you will have to install Minimap2 and TaxonKit manually and add it to your system path. Irrespective of the package manager, if you want to use Kaiju results, you have to download and setup the NCBI taxdump database for TaxonKit.
After setting up, run the following command to ensure that condiga
is working.
condiga --help
Usage: condiga [OPTIONS]
ConDiGA: Contigs directed gene annotation for accurate protein sequence
database construction in metaproteomics.
Options:
-c, --contigs PATH path to the contigs file [required]
-ta, --taxa PATH path to the taxonomic classification results
file [required]
-g, --genes PATH path to the genes file [required]
-cov, --coverages PATH path to the contig coverages file
[required]
-as, --assembly-summary PATH path to the assembly_summary.txt file
[required]
-ra, --rel-abundance FLOAT RANGE
minimum relative abundance cut-off
[default: 0.0001; 0<=x<=1]
-gc, --genome-coverage FLOAT RANGE
minimum genome coverage cut-off [default:
0.001; 0<=x<=1]
-mt, --map-threshold FLOAT RANGE
minimum mapping length threshold cut-off
[default: 0.5; 0<=x<=1]
-t, --nthreads INTEGER number of threads to use [default: 8]
-o, --output PATH path to the output folder [required]
--help Show this message and exit.
Before running ConDiGA, you have to process your data as follows.
You have to assemble your reads into contigs using MEGAHIT as follows. Currently, ConDiGA only supports MEGAHIT assemblies.
megahit -1 Reads/reads_1.fq.gz -2 Reads/reads_2.fq.gz -o MEGAHIT_output -t 16
Next, you have to perform taxonomic annotation on your contigs. You can use any tool such as Kraken2, Kaiju or even BLAST.
As an example, let's run Kraken2 as follows. $DBNAME
is the path to your Kraken database.
kraken2 --threads 16 --db $DBNAME --use-names --output kraken_res_0.1.txt --confidence 0.1 --report kraken_report_0.1.txt MEGAHIT_output/final.contigs.fa
Now, you can run the convert
command to convert your result to a form that can be used as input to condiga
. The result will be saved to the Kraken
folder. Currently, convert
supports results from Kraken2, Kaiju and BLAST.
mkdir Kraken
convert -i kraken_res_0.1.txt -t kraken -o Kraken
NOTE: Since, different annotation tools output results in different formats, you have to format the annotation results using covert
which will output the result in a standard format readable by ConDiGA.
You can use CoverM to get the coverage values of contigs as follows.
coverm contig -1 Reads/reads_1.fastq -2 Reads/reads_2.fastq -r MEGAHIT_output/final.contigs.fa -o contig_coverage.tsv -t 16
You can predict the genes in the contigs using MetaGeneMark as follows. You will find the nucleotide and amino acid sequences of the predicted genes in a file named final.contigs.fa.lst
.
gmhmmp -m MetaGeneMark_v1.mod final.contigs.fa
You can download the assembly summary file for bacteria from NCBI as follows.
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
Once you have preprocessed your data and obtained all the necessary files, you can run condiga
as follows.
condiga -c final.contigs.fa -ta Kraken/kraken_result.txt -g final.contigs.fa.lst -cov contig_coverages.tsv -as assembly_summary.txt -o <output_folder>
The output of ConDiGA will contain the following main files and folders.
genes.species.mapped.xlsx
contains the gene annotation resultsall_genes.fna
contains nucleotide sequences of the predicted genesall_genes.faa
contains amino acid sequences of the predicted genesall_genes.output
containsminimap2
mapping results for the predicted genesAssemblies
contains FASTA files of the downloaded reference genomes
If you have any questions, issues or suggestions, please post them under ConDiGA Issues.
Are you interested in contributing to the ConDiGA project? If so, you can check out the contributing guidelines in CONTRIBUTING.md.
The ConDiGA logo was generated using DALL·E 3 from OpenAI with the following prompt.
Create an icon that visually represents the concept of contigs directed gene annotation for a tool logo ensuring the background is completely transparent.
ConDiGA is published in Microbiome at DOI: 10.1186/s40168-024-01775-3.
If you use ConDiGA in your work, please as
Wu, E., Mallawaarachchi, V., Zhao, J. et al. Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics. Microbiome 12, 58 (2024). https://doi.org/10.1186/s40168-024-01775-3
@article{Wu2024,
author={Wu, Enhui and Mallawaarachchi, Vijini and Zhao, Jinzhi and Yang, Yi and Liu, Hebin and Wang, Xiaoqing and Shen, Chengpin and Lin, Yu and Qiao, Liang},
title={Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics},
journal={Microbiome},
year={2024},
month={Mar},
day={19},
volume={12},
number={1},
pages={58},
abstract={Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis.},
issn={2049-2618},
doi={10.1186/s40168-024-01775-3},
url={https://doi.org/10.1186/s40168-024-01775-3}
}
NOTE: The database created by ConDiGA is described as MD3 in the manuscript.
Also, please cite the following tools used by ConDiGA, the assembler and the relevant taxonomic annotation tool used to obtain the results.
- Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic acids research, 38 (12): 132-132 (2010). https://doi.org/10.1093/nar/gkq275
- Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100 (2018). https://doi.org/10.1093/bioinformatics/bty191
- Woodcroft BJ, Newell R, CoverM: Read coverage calculator for metagenomics (2017). https://github.com/wwood/CoverM