This respository contains the steps to generate the MIMIC-IV-ICD-10-N3 dataset used in this paper KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing. To cite the original article:
@article{Lui_Xiang_Krishnaswamy_2024,
title={KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing},
volume={38},
url={https://ojs.aaai.org/index.php/AAAI/article/view/30314},
DOI={10.1609/aaai.v38i21.30314},
number={21},
journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Lui, Sheng Jie and Xiang, Cheng and Krishnaswamy, Shonali}, year={2024}, month={Mar.},
pages={22797-22805} }
The MIMIC-IV files can be obtained from this website. You can download it to the directory mimicdata/physionet.org
The original script used to generate the MIMIC-IV-ICD-10-N3 dataset uses Python 3.11. We set all random seeds to 2023
.
- Download the MIMIC-IV dataset.
- Load
diagnoses_icd.csv.gz
as a pandas DataFrame:- Apply the filter
icd_version==10
- Perform a groupby on
subject_id
andhadm_id
, then aggregate the icd_code into lists.
- Apply the filter
- Load
discharge.csv.gz
as a pandas DataFrame: - Merge (inner join) the grouped dataframe from step 2 with discharge summary from step 3.
- Generate Negative Samples:
- Randomly select one-third of the rows from the dataframe generated in step 4.
- For each selected row, generate dummy ICD codes that do not belong in the same chapter as the original ICD code.
- Concatenate dataframes generated from step 4 and 5 to obtain the complete dataset.
- Obtain the train and test dataset by performing a random split where
test_size=0.3
.