June 2024 — A new and improved implementation of Semantic Uncertainty is available, this repo is deprecated
We're excited to share a new implementation of semantic uncertainty which corresponds to our 2024 Nature paper Detecting Hallucinations in Large Language Models Using Semantic Entropy. Please use the new and improved version, we are deprecating this repository. Thank you for your interest!
This repository contains the code for our 2023 ICLR paper Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation.
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
This repository contains the code used in Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (arXiv)
run_pipeline.sh
is a slurm batch script that executes all steps of our pipeline. sbatch run_pipeline.sh
submits the batch script.
parse_triviaqa.py
and parse_coqa.py
load TriviaQA and CoQA from HuggingFace, tokenize it and store the data sets. These scripts only have to be run once.
You'll also have to set the paths where you would like to store intermediate and final results of the pipeline in config.py
.
The environment.yml
lists the dependencies of the conda environment we used for our experiments.
The components of our pipeline are:
generate.py
generates a number of answers for a subset of questions of a given data set. This step also evaluates the question-answering accuracy of the generated answers.clean_generations.py
post-processes the generations from the first step, mainly by removing any unwanted trailing text, e.g. in cases where the model first gives the answer to the given question and then generates an additional question.get_semantic_similarities.py
identifies semantic clusters in the generated set of answers from the previous step.get_prompting_based_uncertainty.py
computes the p(True) baseline.compute_likelihoods.py
computes the likelihoods of the generated answers under the generating model.compute_confidence_measure.py
computes a range of different conficence/uncertainty measures such as the semantice entropy predictive entropy, lexical similarity, and p(True).
After running the pipeline, use analyze_result.py
to compute performance metrics, such as the AUROC.
Most model runs should run with at most 40GB of GPU memory. An exception are the experiments on OPT-30B which we run on two 80GB A100s.
Our implemenetation uses PyTorch and HuggingFace. We use wandb
to track our runs. environment