A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

This repository contains the code to reproduce the results of the paper A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models by Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, and Xiaoyi Jiang.

We released the 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries annotated for unsupported facts by two medical experts on PhysioNet. We also published all datasets created in our work to fully reproduce our experiments.

If you consider our work helpful or use our datasets, please consider the citations for our paper and PhysioNet repository:

@InProceedings{pmlr-v248-hegselmann24a,
  title = 	{A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models},
  author =      {Hegselmann, Stefan and Shen, Zejiang and Gierse, Florian and Agrawal, Monica and Sontag, David and Jiang, Xiaoyi},
  booktitle = 	{Proceedings of the fifth Conference on Health, Inference, and Learning},
  pages = 	{339--379},
  year = 	{2024},
  volume = 	{248},
  series = 	{Proceedings of Machine Learning Research},
  month = 	{27--28 Jun},
  publisher =   {PMLR},
  url = 	{https://proceedings.mlr.press/v248/hegselmann24a.html},
}

@Misc{hegselmann_ann-pt-summ2024,
  title = 	{Medical Expert Annotations of Unsupported Facts in {Doctor}-{Written} and LLM-Generated Patient Summaries},
  author =      {Hegselmann, Stefan and Shen, Zejiang and Gierse, Florian and Agrawal, Monica and Sontag, David and Jiang, Xiaoyi},
  booktitle = 	{Proceedings of the fifth Conference on Health, Inference, and Learning},
  year = 	{2024},
  publisher =   {PhysioNet},
  url = 	{https://physionet.org/content/ann-pt-summ/1.0.0/},
  doi = 	{https://doi.org/10.13026/a66y-aa53},
}

Overview

Here you will find the general procedures to setup the environment, download the data, and run the code. More detailed instructions for each component of the project can be found in the respective folders.

gpt-4: All code related to the GPT-4 experiments.
hallucination_detection: All code related to the hallucination detection experiments without gpt-4.
labeling: Scripts to analyse and work with labeling data created with MedTator.
notebooks: Jupyter notebooks for different experiments, helper tasks, and analyses.
preprocess: Preprocessing pipeline as presented in the paper.
scripts: Scripts for parameter tuning of LED and LLama 2 models.
summarization: All code related to the summarization experiments with LED and Llama 2 models.

Setting Correct Paths

We assume the root path to be /root in this readme and for the code. Hence, we assume the repository is cloned to /root/patient_summaries_with_LLMs. Please adapt the paths according to your local setup.

Preparing the Environment

We used conda to create the necessary virtual environments. For the ps_llms environment, we used python 3.9.18:

conda create -n ps_llms python==3.9.18
conda activate ps_llms

Next, install the nevessary requirements. For installing torch you might adapt the command in the first line based on this suggestion.

pip install torch torchvision torchaudio
pip install transformers bitsandbytes sentencepiece accelerate datasets peft trl py7zr scipy wandb evaluate rouge-score sacremoses sacrebleu seqeval bert_score swifter bioc medcat plotly nervaluate nbformat kaleido
pip install -U spacy
python -m spacy download en_core_web_sm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Overview

Setting Correct Paths

Preparing the Environment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
gpt-4		gpt-4
hallucination_detection		hallucination_detection
labeling		labeling
notebooks		notebooks
preprocess		preprocess
scripts		scripts
summarization		summarization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py

License

stefanhgm/patient_summaries_with_llms

Folders and files

Latest commit

History

Repository files navigation

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Overview

Setting Correct Paths

Preparing the Environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages