YACHT (Code for Reproducibility of Experiments)

PLEASE NOTE: this repo is created for the benchmarking experiments and the proof-of-concept of YACHT. For updates on the production-level code, please follow this repo

Installation

Please install the necessary packages via Conda following the commands below:

# Clone the repo
git clone https://github.com/KoslickiLab/YACHT-reproducibles.git
cd YACHT

# Install Conda environment
conda env create -f env/yacht_proof_env.yml

# Activiate environment
conda activate yacht_proof_env

Benchmarking Experiments

To evaluate the performance of YACHT, we utilize the robust public datasets from the Critical Assessment of Metagenome Interpretation II (CAMI II) for the tasks of taxonomic profiling and clinical pathogen detection, which contains the rhizosphere data, marine data, strain madness data, and clinical pathogen data. These datasets include the rhizosphere data, marine data, strain madness data, and clinical pathogen data, providing high-quality benchmarks for evaluating metagenomic analysis tools like YACHT. We also leverage the CAMI-official profiling assessment tool OPAL to compare YACHT's performance against the state-of-the-art (SOTA) tools (e.g., Bracken, Metaglin, mOTUs, MetaPhlAn, CCMetagen, NBC++, MetaPhyler, LSHVec) suggested by CAMI II.

How to reproduce the evaluation results

We provide two bash scripts under ./benchmark/scripts folder to reproduce our evaluation results. To run these scripts, please make sure your have installed the Conda environnment suggested above. After that, run the following instruction:

Git clone the production-level YACHT repositoary:

git clone https://github.com/KoslickiLab/YACHT.git

Download CAMI2 datasets

# download_cami2_data.sh <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/download_cami2_data.sh <path_to_YACHT-reproducibles> 50

Run YACHT on the CAMI datasets

# run_YACHT.sh <yacht_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_YACHT_cami2.sh <path_to_YACHT-reproducibles> 20

Run OPAL on the YACHT results

# git clone OPAL 
git clone https://github.com/CAMI-challenge/OPAL
# run_OPAL.sh <opal_repo_loc> <benchmark_dir> <cpu_num>
bash ./benchmark/scripts/bash_scripts/run_OPAL_cami2.sh <path_to_YACHT-reproducibles> 20

Proof-of-Concept Experiments

Creating a reference dictionary matrix (`ref_matrix.py`):

python ref_matrix.py --ref_file '../ForSteve/ref_gtdb-rs207.genomic-reps.dna.k31.zip' --out_prefix 'test2_' --N 20

Computing relative abundance of organisms (`recover_abundance.py`):

python recover_abundance.py --ref_file 'test2_ref_matrix_processed.npz' --sample_file '../ForSteve/sample.sig' --hash_file 'test2_hash_to_col_idx.csv' --org_file 'test2_processed_org_idx.csv' --w 0.01 --outfile 'test2_recovered_abundance.csv'

Basic workflow

run python ref_matrix.py --ref_file 'tests/testdata/20_genomes_sketches.zip' --out_prefix 'tests/unittest_' . This should generate 4 files in the tests folder.
run python recover_abundance.py --ref_file 'tests/unittest_ref_matrix_processed.npz' --sample_file 'tests/testdata/sample.sig' --hash_file 'tests/unittest_hash_to_col_idx.csv' --org_file 'tests/unittest_processed_org_idx.csv' --w 0.01 --outfile 'tests/unittest_recovered_abundance.csv' . Should create a file tests/unittest_recovered_abundance.csv which should be all zeros.
run the same command as above, but with --w 0.0001. Should overwrite tests/unittest_recovered_abundance.csv with a 6 in the 19th row

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
benchmark/scripts		benchmark/scripts
env		env
experiments		experiments
real_world_experiment/scripts		real_world_experiment/scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
column_descriptions.csv		column_descriptions.csv
compute_weight.py		compute_weight.py
hypothesis_recovery.py		hypothesis_recovery.py
recover_abundance.py		recover_abundance.py
ref_matrix.py		ref_matrix.py
sample_vector.py		sample_vector.py
solve_lp.py		solve_lp.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YACHT (Code for Reproducibility of Experiments)

Installation

Benchmarking Experiments

How to reproduce the evaluation results

Proof-of-Concept Experiments

Creating a reference dictionary matrix (`ref_matrix.py`):

Computing relative abundance of organisms (`recover_abundance.py`):

Basic workflow

About

Releases

Packages

Contributors 3

Languages

KoslickiLab/YACHT-reproducibles

Folders and files

Latest commit

History

Repository files navigation

YACHT (Code for Reproducibility of Experiments)

Installation

Benchmarking Experiments

How to reproduce the evaluation results

Proof-of-Concept Experiments

Creating a reference dictionary matrix (ref_matrix.py):

Computing relative abundance of organisms (recover_abundance.py):

Basic workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Creating a reference dictionary matrix (`ref_matrix.py`):

Computing relative abundance of organisms (`recover_abundance.py`):

Packages