PLEASE NOTE: this repo is created for the benchmarking experiments and the proof-of-concept of YACHT. For updates on the production-level code, please follow this repo
Please install the necessary packages via Conda following the commands below:
# Clone the repo
git clone https://github.com/KoslickiLab/YACHT-reproducibles.git
cd YACHT
# Install Conda environment
conda env create -f env/yacht_proof_env.yml
# Activiate environment
conda activate yacht_proof_env
To evaluate the performance of YACHT, we utilize the robust public datasets from the Critical Assessment of Metagenome Interpretation II (CAMI II) for the tasks of taxonomic profiling and clinical pathogen detection, which contains the rhizosphere data, marine data, strain madness data, and clinical pathogen data. These datasets include the rhizosphere data, marine data, strain madness data, and clinical pathogen data, providing high-quality benchmarks for evaluating metagenomic analysis tools like YACHT. We also leverage the CAMI-official profiling assessment tool OPAL to compare YACHT's performance against the state-of-the-art (SOTA) tools (e.g., Bracken, Metaglin, mOTUs, MetaPhlAn, CCMetagen, NBC++, MetaPhyler, LSHVec) suggested by CAMI II.
We provide two bash scripts under ./benchmark/scripts
folder to reproduce our evaluation results. To run these scripts, please make sure your have installed the Conda environnment suggested above. After that, run the following instruction:
- Git clone the production-level YACHT repositoary:
git clone https://github.com/KoslickiLab/YACHT.git
- Download CAMI2 datasets
# download_cami2_data.sh <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/download_cami2_data.sh <path_to_YACHT-reproducibles> 50
- Run YACHT on the CAMI datasets
# run_YACHT.sh <yacht_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_YACHT_cami2.sh <path_to_YACHT-reproducibles> 20
- Run OPAL on the YACHT results
# git clone OPAL
git clone https://github.com/CAMI-challenge/OPAL
# run_OPAL.sh <opal_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_OPAL_cami2.sh <path_to_YACHT-reproducibles> 20
python ref_matrix.py --ref_file '../ForSteve/ref_gtdb-rs207.genomic-reps.dna.k31.zip' --out_prefix 'test2_' --N 20
python recover_abundance.py --ref_file 'test2_ref_matrix_processed.npz' --sample_file '../ForSteve/sample.sig' --hash_file 'test2_hash_to_col_idx.csv' --org_file 'test2_processed_org_idx.csv' --w 0.01 --outfile 'test2_recovered_abundance.csv'
- run
python ref_matrix.py --ref_file 'tests/testdata/20_genomes_sketches.zip' --out_prefix 'tests/unittest_'
. This should generate 4 files in the tests folder. - run
python recover_abundance.py --ref_file 'tests/unittest_ref_matrix_processed.npz' --sample_file 'tests/testdata/sample.sig' --hash_file 'tests/unittest_hash_to_col_idx.csv' --org_file 'tests/unittest_processed_org_idx.csv' --w 0.01 --outfile 'tests/unittest_recovered_abundance.csv'
. Should create a filetests/unittest_recovered_abundance.csv
which should be all zeros. - run the same command as above, but with
--w 0.0001
. Should overwritetests/unittest_recovered_abundance.csv
with a 6 in the 19th row