This repository contains source code for the paper Hospital Discharge Summarization Data Provenance. If you only want to use the annotations, see the Inclusion in Your Projects section. The unsupervised automated method code is also located in a separate repository.
This repository contains the annotations used in the paper and classes for creating discharge summary provenience of data annotations. This project attempts to give an idea of from where physicians copy/paste and/or summarize previous medical records when writing discharge summaries.
The project also used an automated method for note matching and an automated method for note segmentation.
The purpose of this repository is to reproduce the results in the paper. If you want to use the annotations and/or use the pretrained model, please refer to the zensols.dsprov repository. This repository also provides a Docker image. If you use our annotations and/or code, please cite our paper.
The source annotation files are necessary to reproduce our results. Those can be obtained by requesting them from the authors.
Important: you must provide proof that you have access to by requesting MIMIC-III access in your email request for the source annotations.
Dependencies:
- A macOS machine.
- Microsoft Word (used to annotate spans across notes).
- GNU make. What default that comes with macOS should be sufficient. However, brew might be necessary to install the GNU version of some system tools.
Steps to reproducing:
- Clone this repository and go in to it:
git clone https://github.com/uic-nlp-lab/dsprov && cd dsprov
- Optionally create a virtual environment:
python -m venv <Python install dir>
- Install Python dependencies:
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True make deps
- Copy the source annotations compressed file to the current directory.
- Install the source annotation files in the
corpus/completed
directory:$ unzip dsprov-source-annotations.zip
- Load MIMIC-III by following the Postgres instructions. Also see the zensols.mimic instructions for an SQLite alternative.
- Edit etc/db.conf using the parameters of the installed database from the previous step.
- Tell programs where to find the database configuration (assuming Bash):
export MIMICSIDRC=./etc/db.conf
- Create the corpus and matching statistics (also confirms everything is
installed and working):
./harness.py excel -o match.xlsx
- Check for errors and confirm the data in generated file is sound:
open match.xlsx
- Run the hyperparameter optimization:
./src/bin/opthyper.py opt -e 500
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landesHospitalDischargeSummarization2023,
title = {Hospital {{Discharge Summarization Data Provenance}}},
booktitle = {The 22nd {{Workshop}} on {{Biomedical Natural Language Processing}} and {{BioNLP Shared Tasks}}},
author = {Landes, Paul and Chaise, Aaron and Patel, Kunal and Huang, Sean and Di Eugenio, Barbara},
date = {2023-07},
pages = {439--448},
publisher = {{Association for Computational Linguistics}},
location = {{Toronto, Canada}},
url = {https://aclanthology.org/2023.bionlp-1.41},
urldate = {2023-07-10},
eventtitle = {{{BioNLP}} 2023}
}
Also please cite the Zensols Framework:
@article{Landes_DiEugenio_Caragea_2021,
title={DeepZensols: Deep Natural Language Processing Framework},
url={http://arxiv.org/abs/2109.03383},
note={arXiv: 2109.03383},
journal={arXiv:2109.03383 [cs]},
author={Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia},
year={2021},
month={Sep}
}
Copyright (c) 2023 Paul Landes