Welcome to the PromptEval GitHub repository! Here you will find more information about our implementation of PromptEval and datasets introduced in
Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. This repository introduces our implementation of PromptEval, a method for estimating performance across a large set of prompts by borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets.
Please check our demo on how to use PromptEval in your own data.
data/
: Contains the evaluation data used in the experiments.prompteval/
: Source code for the PromptEval method and utilities.notebooks/
: Jupyter notebooks used to create plots for the PromptEval paper.results/
: Results from the experiments conducted in the paper.mmlu_data/
: Contains code for gathering evaluation data.
To use the code in this repository, clone the repo and install the required dependencies:
git clone https://github.com/felipemaiapolo/prompteval.git
cd prompteval
pip install -e .
PromptEval was implemented by PromptBench. Please check it here.
To learn how to combine PromptEval with lm-evaluation-harness
, please follow these steps:
- Please Git clone our version of
lm-evaluation-harness
into the main directory of PromptEval; we have already submitted a PR to the main version of the package. - Git checkout the
examples-arg
branch of thelm-evaluation-harness
repository you have just cloned. - Inside the
lm-evaluation-harness
main directory, please installpip install -e .
. - Please check our demo. We focus on MMLU; however, the ideas we present can be used for other benchmarks as well.
To reproduce the results in our paper, please follow the steps after cloning the repo and installing dependencies:
- Download the BBH and LMentry data, produced by the authors of "State of What Art? A Call for Multi-Prompt LLM Evaluation", from here. Place the unzipped folder "raw open-source model responses with gold and auto validation values" inside the data directory;
- Process data by running
./prompteval/create_data.py
; - Run main experiments by running
./prompteval/dist_evaluation.py
. Example:python ./prompteval/dist_evaluation.py --bench 'BBH' --random_seeds 5
; - Run best prompt identification by running
./prompteval/bai_evaluation.py
. Example:python ./prompteval/bai_evaluation.py --bench 'BBH' --random_seeds 5
. - Create plots using the notebooks in the notebooks directory.
To fine-tune BERT representations run the following:
python ./prompteval/ft_representations.py --model_name "bert-base-uncased" \
--lr 2e-05 \
--weight_decay 1e-06 \
--gamma .99995 \
--bs 96 \
--n_epochs 5 \
--warmup_steps 200 \
--bench "BBH"
Note, that this requires the file ./data/Ys.pickle
to contain correctness data for the respective benchmark as the create_data.py
script creates it. Add --push_to_hub
, to automatically push the resulting model to your namespace on the huggingface hub (remember to huggingface-cli login
before training).
To run the LLM-as-a-judge experiment, please follow the steps:
- Install AlpacaEval 2.0 using the command
pip install alpaca-eval==0.6.4
; - Run
python ./prompteval/generate_prompts.py
to generate prompt variations. Having a GPU will accelerate this step because we use SentenceTransformers to encode texts; - Move the directories
./prompteval/data/templates/AlpacaEval/configs
and./prompteval/data/templates/AlpacaEval/templates
to yourevaluators_configs
AlpacaEval folder; for example, if you are using a Miniconda 3 (or Anaconda) environment, your folder should be in the directoryminiconda3/envs/{ENV_NAME}/lib/python{PYTHON_VERSION}/site-packages/alpaca_eval
; - Open
./prompteval/evaluate.py
and, at the top of the file, create an object calledevaluators_configs_path
and paste the path to theevaluators_configs
directory to it; if you are using a Miniconda 3 (or Anaconda) environment, yourevaluators_configs
directory should be in the directoryhome/miniconda3/envs/{ENV_NAME}/lib/python{PYTHON_VERSION}/site-packages/alpaca_eval/evaluators_configs
; - Export your OpenAI API key following https://pypi.org/project/alpaca-eval/0.6.4/ and run
./prompteval/evaluate.py
to conduct the evaluation step; - Run the notebook
./notebooks/llm_judge_plots.ipynb
to get the plots.
We make our MMLU collected data available on Hugging Face. The data includes evaluation for 15 different SOTA LLMs and 100 different prompt templates.
@article{polo2024efficient,
title={Efficient multi-prompt evaluation of LLMs},
author={Polo, Felipe Maia and Xu, Ronald and Weber, Lucas and Silva, M{\'\i}rian and Bhardwaj, Onkar and Choshen, Leshem and de Oliveira, Allysson Flavio Melo and Sun, Yuekai and Yurochkin, Mikhail},
journal={arXiv preprint arXiv:2405.17202},
year={2024}
}