Skip to content

KoslickiLab/YACHT-reproducibles

Repository files navigation

YACHT (Code for Reproducibility of Experiments)

PLEASE NOTE: this repo is created for the benchmarking experiments and the proof-of-concept of YACHT. For updates on the production-level code, please follow this repo

Installation

Please install the necessary packages via Conda following the commands below:

# Clone the repo
git clone https://github.com/KoslickiLab/YACHT-reproducibles.git
cd YACHT

# Install Conda environment
conda env create -f env/yacht_proof_env.yml

# Activiate environment
conda activate yacht_proof_env

Benchmarking Experiments

To evaluate the performance of YACHT, we utilize the robust public datasets from the Critical Assessment of Metagenome Interpretation II (CAMI II) for the tasks of taxonomic profiling and clinical pathogen detection, which contains the rhizosphere data, marine data, strain madness data, and clinical pathogen data. These datasets include the rhizosphere data, marine data, strain madness data, and clinical pathogen data, providing high-quality benchmarks for evaluating metagenomic analysis tools like YACHT. We also leverage the CAMI-official profiling assessment tool OPAL to compare YACHT's performance against the state-of-the-art (SOTA) tools (e.g., Bracken, Metaglin, mOTUs, MetaPhlAn, CCMetagen, NBC++, MetaPhyler, LSHVec) suggested by CAMI II.

How to reproduce the evaluation results

We provide two bash scripts under ./benchmark/scripts folder to reproduce our evaluation results. To run these scripts, please make sure your have installed the Conda environnment suggested above. After that, run the following instruction:

  1. Git clone the production-level YACHT repositoary:
git clone https://github.com/KoslickiLab/YACHT.git
  1. Download CAMI2 datasets
# download_cami2_data.sh <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/download_cami2_data.sh <path_to_YACHT-reproducibles> 50
  1. Run YACHT on the CAMI datasets
# run_YACHT.sh <yacht_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_YACHT_cami2.sh <path_to_YACHT-reproducibles> 20
  1. Run OPAL on the YACHT results
# git clone OPAL 
git clone https://github.com/CAMI-challenge/OPAL
# run_OPAL.sh <opal_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_OPAL_cami2.sh <path_to_YACHT-reproducibles> 20

Proof-of-Concept Experiments

Creating a reference dictionary matrix (ref_matrix.py):

python ref_matrix.py --ref_file '../ForSteve/ref_gtdb-rs207.genomic-reps.dna.k31.zip' --out_prefix 'test2_' --N 20

Computing relative abundance of organisms (recover_abundance.py):

python recover_abundance.py --ref_file 'test2_ref_matrix_processed.npz' --sample_file '../ForSteve/sample.sig' --hash_file 'test2_hash_to_col_idx.csv' --org_file 'test2_processed_org_idx.csv' --w 0.01 --outfile 'test2_recovered_abundance.csv'

Basic workflow

  1. run python ref_matrix.py --ref_file 'tests/testdata/20_genomes_sketches.zip' --out_prefix 'tests/unittest_' . This should generate 4 files in the tests folder.
  2. run python recover_abundance.py --ref_file 'tests/unittest_ref_matrix_processed.npz' --sample_file 'tests/testdata/sample.sig' --hash_file 'tests/unittest_hash_to_col_idx.csv' --org_file 'tests/unittest_processed_org_idx.csv' --w 0.01 --outfile 'tests/unittest_recovered_abundance.csv' . Should create a file tests/unittest_recovered_abundance.csv which should be all zeros.
  3. run the same command as above, but with --w 0.0001. Should overwrite tests/unittest_recovered_abundance.csv with a 6 in the 19th row

About

A repo to reproduce the experiments in the YACHT manuscript

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published