POCA Project

Partially Observable Cost-Aware Active-Learning with Large Language Models

Official code repository for NeurIPS'24 paper Partially Observable Cost-Aware Active-Learning with Large Language Models.

Authors: Nicolás Astorga, Tennison Liu, Nabeel Seedat, Mihaela van der Schaar

Abstract

Conducting experiments and collecting data for machine learning models is a complex and expensive endeavor, particularly when confronted with limited information. Typically, extensive experiments to obtain features and labels come with a significant acquisition cost, making it impractical to carry out all of them. Therefore, it becomes crucial to strategically determine what to acquire to maximize the predictive performance while minimizing costs. To perform this task, existing data acquisition methods assume the availability of an initial dataset that is both fully observed and labeled, crucially overlooking the partial observability of features characteristic of many real-world scenarios. In response to this challenge, we present Partially Observable Cost-Aware Active-Learning (POCA), a new learning approach aimed at improving model generalization in data-scarce and data-costly scenarios through label and/or feature acquisition. Introducing $\mu$POCA as an instantiation, we maximize the uncertainty reduction in the predictive model when obtaining labels and features, considering associated costs. $\mu$POCA enhances traditional Active Learning metrics based solely on the observed features by generating the unobserved features through Generative Surrogate Models, particularly Large Language Models (LLMs). We empirically validate $\mu$POCA across diverse tabular datasets, varying data availability, and acquisition costs.

POCA Project

This repository provides code and instructions to run and reproduce experiments described in the POCA paper. The experiments focus on train LLM, generating mc samples, and train downstream models.

1. Setup

Step 1: Set up the Conda environment

For GSM trainign and sampling we use unsloth training with the following configuration. At the moment of running the results torch (2.3 was installed, now torch 2.5 is automatically installed)

conda create --name poca python=3.10 pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
conda activate poca

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

Clone the repository and create a Conda environment:

git clone https://github.com/jumpynitro/POCA.git
conda activate poca

Step 2: Install dependencies

Install the required packages:

pip install -r requirements.txt

2. Running the Custom Acquisition Process

The acquisition process has three components:

Train the LLM
Generate samples with the LLM (MC samples)
Train the downstream model (RF or NN) and perform acquisition

Example Experiment

For instance, to train the mistral model on the uci/magic dataset using a seed of 1, generate 8 MC samples, and perform acquisition using Random Forest (RF) on historical data with metrics like PO-EIG, EIG, Random, and EIG (full feature observation) across 60 seeds, follow these steps:

Step 1: Train the LLM

Run the following to train the LLM:

python unsloth_train_llm.py \
    data="uci/magic" \
    llm_cfg.pool_set_type='Hist' \
    llm_cfg.llm_name='mistral3-unsloth' \
    llm_cfg.llm_dir="LLM_dir" \
    llm_cfg.add_name='-seed1-pdict-true' \
    llm_cfg.llm_seed=1

Step 2: Generate MC Samples

Run the following to generate MC samples:

python unsloth_eval_llm.py \
    data="uci/magic" \
    llm_cfg.pool_set_type='Hist' \
    llm_cfg.llm_name='mistral3-unsloth' \
    llm_cfg.batch_size_generation=100 \
    llm_cfg.mc_samples=8 \
    llm_cfg.add_name='-seed1-pdict-true' \
    llm_cfg.llm_dir="LLM_Hist"

Step 3: Train the RF Model and Run Acquisition Process

Acquistion with GSMs Use the following command to train the RF model and execute the acquisition:

python main.py --multirun 'rng.seed=range(60)' \
    data="uci/magic" \
    results_main_dir="rf_results_Hist" \
    acquisition.objective=bald-po,bald,random,full-bald \
    model=random_forest \
    trainer=random_forest \
    llm_cfg.llm_name='mistral3-unsloth' \
    llm_cfg.pool_set_type='Hist' \
    llm_cfg.add_name='-seed1-pdict-true' \
    llm_cfg.llm_dir="LLM_Hist"

For this specific case we can use the following script:

python generate_plots/create_plot_general.py --experiment_type specific

Cost based acquisition To run acquisition with subset of features acquired run:

python main.py --multirun 'rng.seed=range(60)' \
    data="uci/magic" \
    results_main_dir="rf_results_Hist" \
    acquisition.objective=bald-po-feature-0.2,bald-po-feature-0.6 \
    model=random_forest \
    trainer=random_forest \
    llm_cfg.llm_name='mistral3-unsloth' \
    llm_cfg.pool_set_type='Hist' \
    llm_cfg.add_name='-seed1-pdict-true' \
    llm_cfg.llm_dir="LLM_dir"

and for visualizing

python generate_plots/create_plot_cost.py

3. Reproducing Paper Results

You can reproduce most of the paper's results by following these steps:

Set the path for results storage:
Specify the path where LLMs, generated samples, and results should be saved:
```
export path_mnt=...
```
Generate samples from LLMs (GSMs):
- GSM trained on historical data:
```
bash jobs/samples_GSM_all_llms.sh
```
- GSM trained on pool data:
```
bash jobs/samples_GSM_mistral_pool.sh
```
Run models using RF and acquisition metrics: PO-EIG, EIG, Random, and EIG (all features):
1. RF training and acquisition on historical data:
```
bash jobs/run_rf_eig.sh
```
1. RF training and acquisition on pool data:
```
bash jobs/run_rf_eig_pool.sh
```
Run additional acquisition metrics: EPIG, MeanSTD, Marginal Entropy:
```
bash jobs/run_rf_other_models.sh
```
Run EIG with a subset of features: Experiment with acquiring only 20% or 60% of features on the magic dataset:
```
bash jobs/run_rf_cost.sh
```
**Plots some results Results EIG (run 3.1) and EPIG (run 4) python generate_plots/create_plot_general.py --experiment_type vary_model_main --output_file model_main.pdf

Results Varying LLMs python generate_plots/create_plot_general.py --experiment_type vary_llm --output_file llm_main.pdf

Citation

If our paper or code helped you in your own research, please cite our work as:

@inproceedings{
    astorga2024poca,
    title={Partially Observable Cost-Aware Active-Learning with Large Language Models},
    author={Nicol{\'a}s Astorga and Tennison Liu and Nabeel Seedat and Mihaela van der Schaar},
    booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LLM_Hist/magic/mistral3-unsloth/feat_per_0.5_pool_n_samples_1000_pool_type_Hist-seed1-pdict-true/MAX_STEPS_10000_LR_7.5e-05_R_32_ALPHA_64_PB_2/samples		LLM_Hist/magic/mistral3-unsloth/feat_per_0.5_pool_n_samples_1000_pool_type_Hist-seed1-pdict-true/MAX_STEPS_10000_LR_7.5e-05_R_32_ALPHA_64_PB_2/samples
config		config
data		data
feature_selected		feature_selected
generate_plots		generate_plots
jobs		jobs
rf_results_Hist/magic/mistral3-unsloth/feat_per_0.5_pool_n_samples_1000_pool_type_Hist-seed1-pdict-true/MAX_STEPS_10000_LR_7.5e-05_R_32_ALPHA_64_PB_2		rf_results_Hist/magic/mistral3-unsloth/feat_per_0.5_pool_n_samples_1000_pool_type_Hist-seed1-pdict-true/MAX_STEPS_10000_LR_7.5e-05_R_32_ALPHA_64_PB_2
src		src
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
unsloth_eval_llm.py		unsloth_eval_llm.py
unsloth_train_llm.py		unsloth_train_llm.py
utils_llm.py		utils_llm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Partially Observable Cost-Aware Active-Learning with Large Language Models

Abstract

POCA Project

1. Setup

Step 1: Set up the Conda environment

Step 2: Install dependencies

2. Running the Custom Acquisition Process

Example Experiment

Step 1: Train the LLM

Step 2: Generate MC Samples

Step 3: Train the RF Model and Run Acquisition Process

3. Reproducing Paper Results

Citation

About

Releases

Packages

Languages

vanderschaarlab/POCA

Folders and files

Latest commit

History

Repository files navigation

Partially Observable Cost-Aware Active-Learning with Large Language Models

Abstract

POCA Project

1. Setup

Step 1: Set up the Conda environment

Step 2: Install dependencies

2. Running the Custom Acquisition Process

Example Experiment

Step 1: Train the LLM

Step 2: Generate MC Samples

Step 3: Train the RF Model and Run Acquisition Process

3. Reproducing Paper Results

Citation

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages