This contains a basic baseline model for identifying section in clinical text from the paper A New Public Corpus for Clinical Section Identification: MedSecId. The purpose of this repository is to provide a means to reproduce the results in the paper. If you want to include this work in your own projects, use the mimicsid package described in the Inclusion in Your Projects section, which was designed to be an off-the-shelf package pip install.
Python 3.9.9 was used with the requirements in src/requirements.txt
and
src/requirements-mednlp.txt
.
To train and test the models, use the run.sh
script by:
- Copy the MIMIC-III
NOTEEVENTS.csv
file to thecorpus
directory. - Download the annotation set and uncompress it:
pushd corpus
wget https://zenodo.org/record/7150451/files/section-id-annotations.zip
unzip section-id-annotations.zip
popd
- Remove the repo results:
rm -r results
- Create the Python environment (in
pyvirenv
):./run.sh pyenv
- Install all Python libraries and models:
./run.sh pydep
- Create the features as mini-batches (takes a while):
./run.sh batch
- Test and train the models (takes a while):
./run.sh traintest
- Create the metrics used in the
./run paperresults
At the end of this, there should be a results
directory with:
results/stats
: the corpus statisticsresults/perf
: the summary of the results and labels of the best modelresults/model
: the models and model specific results
The purpose of this repository is to reproduce the results in the paper. If you want to use the annotations and/or use the pretrained model, please refer to the mimicsid repository.
The medical concept (CUI) plot given in the paper, and others are available as interactive 3D plots here.
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landes-etal-2022-new,
title = "A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d",
author = "Landes, Paul and
Patel, Kunal and
Huang, Sean S. and
Webb, Adam and
Di Eugenio, Barbara and
Caragea, Cornelia",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.326",
pages = "3709--3721"
}
Also please cite the Zensols Framework:
@article{Landes_DiEugenio_Caragea_2021,
title={DeepZensols: Deep Natural Language Processing Framework},
url={http://arxiv.org/abs/2109.03383},
note={arXiv: 2109.03383},
journal={arXiv:2109.03383 [cs]},
author={Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia},
year={2021},
month={Sep}
}
Copyright (c) 2022 Paul Landes