From b28726ee7b3d135d0be601f4cb0ad12928b9412f Mon Sep 17 00:00:00 2001 From: kamilest Date: Fri, 30 Aug 2024 16:52:50 -0400 Subject: [PATCH 01/20] Example workflow draft --- README.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/README.md b/README.md index 5f6d098..0b8d881 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,45 @@ configuration files, training recipes, results, etc. for the MEDS-DEV benchmarki often come from other repositories, with suitable permalinks being present in the various configuration files or commit messages for associated contributions to this repository. +## Example workflow + +```bash +# Create and enter a MEDS project directory +mkdir +cd + +# Locate the MEDS data root directory and + +# Create a new python environment +conda create -n python=3.10 +conda activate + +# In , install MEDS-DEV files and dependencies +# TODO: this will be probably be replaced with `pip install MEDS-DEV` in the future +git clone https://github.com/mmcdermott/MEDS-DEV.git +pip install -e ./MEDS-DEV +# TODO: consider the other dependencies that have not been deployed yet and are not in MEDS-DEV dependencies yet, e.g.: +# git clone https://github.com/kamilest/meds-evaluation.git +# pip install -e ./meds-evaluation +# etc. + +# Install any model-specific dependencies + +# TODO: locate and process task predicates in ./MEDS-DEV/tasks/, defining the unknown codes using predicates in +# ./MEDS-DEV/datasets/ + +aces-cli data.path='', data.standard='meds', cohort_dir=TODO, cohort_name=TODO + +# TODO Figure out how ACES processes the cohort and where is the output stored: + +# TODO Train model on , place the outputs in the MEDS prediction format in +# /predictions + +# Evaluate model +meds-evaluation-cli predictions_path='/predictions', \ + output_dir='/evaluation' +``` + ## Contributing to MEDS-DEV ### To Add a Model From 19ae1d9b4914b0a9e4c3132e4f930b35bc2436cd Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 3 Sep 2024 17:46:09 -0400 Subject: [PATCH 02/20] Update the instructions --- README.md | 91 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 73 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 993e0ee..0d036c8 100644 --- a/README.md +++ b/README.md @@ -10,35 +10,90 @@ or commit messages for associated contributions to this repository. ## Example workflow +### (Optional) Set up the MEDS project with environment + ```bash # Create and enter a MEDS project directory -mkdir -cd +mkdir $MY_MEDS_PROJECT_ROOT +cd $MY_MEDS_PROJECT_ROOT + +conda create -n $MY_MEDS_CONDA_ENV python=3.10 +conda activate $MY_MEDS_CONDA_ENV +``` + +Additionally install any model-related dependencies. -# Locate the MEDS data root directory and +### Install MEDS-DEV -# Create a new python environment -conda create -n python=3.10 -conda activate +Clone the MEDS-DEV GitHub repo and install it locally. +This will additionally install some MEDS data processing dependencies: -# In , install MEDS-DEV files and dependencies -# TODO: this will be probably be replaced with `pip install MEDS-DEV` in the future +```bash git clone https://github.com/mmcdermott/MEDS-DEV.git -pip install -e ./MEDS-DEV -# TODO: consider the other dependencies that have not been deployed yet and are not in MEDS-DEV dependencies yet, e.g.: -# git clone https://github.com/kamilest/meds-evaluation.git -# pip install -e ./meds-evaluation -# etc. +cd ./MEDS-DEV +pip install -e . +``` + +Install the MEDS evaluation package: +```bash +git clone https://github.com/kamilest/meds-evaluation.git +pip install -e ./meds-evaluation +``` + +Additionally, make sure any model-related dependencies are installed. + +### Extract a task from the MEDS dataset + +This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and +processing the data to create the label files. + +### Find the task configuration file + +Task-related information is stored in Hydra configuration files (in `.yaml` format) under +`MEDS-DEV/src/MEDS_DEV/tasks/criteria`. -# Install any model-specific dependencies +Task names are defined in a way that corresponds to the path to their configuration, +starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory. +For example, +`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of +`mortality/in_icu/first_24h`. -# TODO: locate and process task predicates in ./MEDS-DEV/tasks/, defining the unknown codes using predicates in -# ./MEDS-DEV/datasets/ +**To add a task** -aces-cli data.path='', data.standard='meds', cohort_dir=TODO, cohort_name=TODO +If your task is not supported, you will need to add a directory and define an appropriate configuration file in +a corresponding location. -# TODO Figure out how ACES processes the cohort and where is the output stored: +### Dataset configuration file +Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific +way (e.g. `icu_admission` in `mortality/in_icu/first_24h`). + +These dataset-specific predicate definitions are found in +`MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files. + +In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. +`$MEDS_ROOT_DIR`). + +**To add a dataset configuration file** + +If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in +a corresponding location. + +### Run the MEDS task extraction helper + +From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is located, run + +```bash +./MEDS-DEV/src/MEDS_DEV/helpers/extract_task.sh $MEDS_ROOT_DIR $DATASET_NAME $TASK_NAME +``` + +This will use information from task and dataset-specific predicate configs to extract cohorts and labels from +`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same +sharded structure [??? TODO check] as the `$MEDS_ROOT_DIR/data` directory. + +### TODO: train and evaluate the model + +``` # TODO Train model on , place the outputs in the MEDS prediction format in # /predictions From 8e9acab069457a29efdb073f62380295c2bae9a9 Mon Sep 17 00:00:00 2001 From: kamilest Date: Mon, 9 Sep 2024 17:44:54 -0400 Subject: [PATCH 03/20] Helper for generating random binary classification labels. --- README.md | 47 +++++++------- src/MEDS_DEV/configs/predictions.yaml | 6 ++ src/MEDS_DEV/helpers/generate_predictions.sh | 16 +++++ .../helpers/generate_random_predictions.py | 65 +++++++++++++++++++ 4 files changed, 112 insertions(+), 22 deletions(-) create mode 100644 src/MEDS_DEV/configs/predictions.yaml create mode 100755 src/MEDS_DEV/helpers/generate_predictions.sh create mode 100644 src/MEDS_DEV/helpers/generate_random_predictions.py diff --git a/README.md b/README.md index 0d036c8..c460214 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ or commit messages for associated contributions to this repository. ### (Optional) Set up the MEDS project with environment ```bash -# Create and enter a MEDS project directory +# Create and enter a MEDS project directory mkdir $MY_MEDS_PROJECT_ROOT cd $MY_MEDS_PROJECT_ROOT @@ -34,7 +34,8 @@ cd ./MEDS-DEV pip install -e . ``` -Install the MEDS evaluation package: +Install the MEDS evaluation package: + ```bash git clone https://github.com/kamilest/meds-evaluation.git pip install -e ./meds-evaluation @@ -44,39 +45,39 @@ Additionally, make sure any model-related dependencies are installed. ### Extract a task from the MEDS dataset -This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and -processing the data to create the label files. +This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and +processing the data to create the label files. ### Find the task configuration file -Task-related information is stored in Hydra configuration files (in `.yaml` format) under +Task-related information is stored in Hydra configuration files (in `.yaml` format) under `MEDS-DEV/src/MEDS_DEV/tasks/criteria`. -Task names are defined in a way that corresponds to the path to their configuration, +Task names are defined in a way that corresponds to the path to their configuration, starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory. -For example, -`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of +For example, +`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of `mortality/in_icu/first_24h`. **To add a task** -If your task is not supported, you will need to add a directory and define an appropriate configuration file in +If your task is not supported, you will need to add a directory and define an appropriate configuration file in a corresponding location. ### Dataset configuration file -Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific +Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific way (e.g. `icu_admission` in `mortality/in_icu/first_24h`). -These dataset-specific predicate definitions are found in +These dataset-specific predicate definitions are found in `MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files. -In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. +In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. `$MEDS_ROOT_DIR`). **To add a dataset configuration file** -If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in +If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in a corresponding location. ### Run the MEDS task extraction helper @@ -88,20 +89,22 @@ From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is locate ``` This will use information from task and dataset-specific predicate configs to extract cohorts and labels from -`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same -sharded structure [??? TODO check] as the `$MEDS_ROOT_DIR/data` directory. +`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same +sharded structure as the `$MEDS_ROOT_DIR/data` directory. -### TODO: train and evaluate the model +### Train the model -``` -# TODO Train model on , place the outputs in the MEDS prediction format in -# /predictions +This step depends on the API of your particular model. + +For example, the command below will call a helper script that will generate random outputs for binary classification, +conforming to MEDS binary classification prediction schema: -# Evaluate model -meds-evaluation-cli predictions_path='/predictions', \ - output_dir='/evaluation' +```bash +./MEDS-DEV/src/MEDS_DEV/helpers/generate_predictions.sh $MEDS_ROOT_DIR $TASK_NAME ``` +### TODO evaluate the model + ## Contributing to MEDS-DEV ### To Add a Model diff --git a/src/MEDS_DEV/configs/predictions.yaml b/src/MEDS_DEV/configs/predictions.yaml new file mode 100644 index 0000000..736ae72 --- /dev/null +++ b/src/MEDS_DEV/configs/predictions.yaml @@ -0,0 +1,6 @@ +defaults: + - _ACES_MD + - _self_ + - override hydra/hydra_logging: disabled + +cohort_predictions_dir: "${oc.env:MEDS_ROOT_DIR}/task_predictions" diff --git a/src/MEDS_DEV/helpers/generate_predictions.sh b/src/MEDS_DEV/helpers/generate_predictions.sh new file mode 100755 index 0000000..606d0f4 --- /dev/null +++ b/src/MEDS_DEV/helpers/generate_predictions.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +export MEDS_ROOT_DIR=$1 +export MEDS_DATASET_NAME=$2 +export MEDS_TASK_NAME=$3 + +shift 3 + +MEDS_DEV_REPO_DIR=$(python -c "from importlib.resources import files; print(files(\"MEDS_DEV\"))") +export MEDS_DEV_REPO_DIR + +# TODO improve efficiency of prediction generator by using this +# SHARDS=$(expand_shards "$MEDS_ROOT_DIR"/data) + +python -m MEDS_DEV.helpers.generate_random_predictions --config-path="$MEDS_DEV_REPO_DIR"/configs \ +--config-name="predictions" "hydra.searchpath=[pkg://aces.configs]" "$@" diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py new file mode 100644 index 0000000..732bc34 --- /dev/null +++ b/src/MEDS_DEV/helpers/generate_random_predictions.py @@ -0,0 +1,65 @@ +import os +from importlib.resources import files +from pathlib import Path + +import hydra +import numpy as np +import polars as pl +from omegaconf import DictConfig + +SUBJECT_ID = "subject_id" + +BOOLEAN_VALUE_COLUMN = "boolean_value" +PREDICTED_BOOLEAN_VALUE_COLUMN = "predicted_boolean_value" +PREDICTED_BOOLEAN_PROBABILITY_COLUMN = "predicted_boolean_probability" + +CONFIG = files("MEDS_DEV").joinpath("configs/predictions.yaml") + + +@hydra.main(version_base=None, config_path=str(CONFIG.parent.resolve()), config_name=CONFIG.stem) +def generate_random_predictions(cfg: DictConfig) -> None: + cohort_dir = cfg.cohort_dir # cohort_dir: "${oc.env:MEDS_ROOT_DIR}/task_labels" + cohort_name = cfg.cohort_name # cohort_name: ${task_name}; task_name: ${oc.env:MEDS_TASK_NAME} + + cohort_dir = Path(cohort_dir) / cohort_name + cohort_predictions_dir = ( + cfg.cohort_predictions_dir + ) # cohort_predictions_dir: "${oc.env:MEDS_ROOT_DIR}/task_predictions" + + # TODO: use expand_shards helper from the script to access sharded dataframes directly + for split in cohort_dir.iterdir(): + if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out + for file in split.iterdir(): + if file.is_file(): + dataframe = pl.read_parquet(file) + predictions = _generate_random_predictions(dataframe) # sharded dataframes + + # $MEDS_ROOT_DIR/task_predictions/$TASK_NAME//.parquet + predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name + os.makedirs(predictions_path, exist_ok=True) + + predictions.write_parquet(predictions_path / file.name) + elif split.is_file(): + dataframe = pl.read_parquet(split) + predictions = _generate_random_predictions(dataframe) + + predictions_path = Path(cohort_predictions_dir) / cohort_name + os.makedirs(predictions_path, exist_ok=True) + + predictions.write_parquet(predictions_path / split.name) + + +def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame: + """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe, + along with predictions.""" + + output = dataframe.select([SUBJECT_ID, BOOLEAN_VALUE_COLUMN]) + probabilities = np.random.uniform(0, 1, len(dataframe)) + output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round())) + output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities)) + + return output + + +if __name__ == "__main__": + generate_random_predictions() From 3afe7bf6cbf1bdbf2a84a10484cee5fb5a8fb85c Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:18:47 -0400 Subject: [PATCH 04/20] Add prediction time --- src/MEDS_DEV/helpers/generate_random_predictions.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py index 732bc34..1c5a01a 100644 --- a/src/MEDS_DEV/helpers/generate_random_predictions.py +++ b/src/MEDS_DEV/helpers/generate_random_predictions.py @@ -8,6 +8,7 @@ from omegaconf import DictConfig SUBJECT_ID = "subject_id" +PREDICTION_TIME = "prediction_time" BOOLEAN_VALUE_COLUMN = "boolean_value" PREDICTED_BOOLEAN_VALUE_COLUMN = "predicted_boolean_value" @@ -53,7 +54,7 @@ def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame: """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe, along with predictions.""" - output = dataframe.select([SUBJECT_ID, BOOLEAN_VALUE_COLUMN]) + output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN]) probabilities = np.random.uniform(0, 1, len(dataframe)) output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round())) output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities)) From 3811f40783906342ad4b3d78390f7dcad8a4a25f Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:18:58 -0400 Subject: [PATCH 05/20] Fix prediction positions and types --- src/MEDS_DEV/helpers/generate_random_predictions.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py index 1c5a01a..a07a26f 100644 --- a/src/MEDS_DEV/helpers/generate_random_predictions.py +++ b/src/MEDS_DEV/helpers/generate_random_predictions.py @@ -56,8 +56,10 @@ def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame: output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN]) probabilities = np.random.uniform(0, 1, len(dataframe)) - output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round())) - output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities)) + # TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to + # be inserted at the correct position and cast to the correct type + output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean)) + output.insert_column(4, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities)) return output From 9e0c0e432187fc1191922e2b06159231da11c6fe Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:19:10 -0400 Subject: [PATCH 06/20] Add evaluation example --- README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index c460214..1066ff6 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,15 @@ conforming to MEDS binary classification prediction schema: ./MEDS-DEV/src/MEDS_DEV/helpers/generate_predictions.sh $MEDS_ROOT_DIR $TASK_NAME ``` -### TODO evaluate the model +### Evaluate the model + +You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to predictions +dataframe as well as the output directory. For example, + +```bash +meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/in_icu/first_24h/train/0.parquet" +\ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/" +``` ## Contributing to MEDS-DEV From 791d18f19a1ab25f2659cf3dde127a87e9616e42 Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:19:18 -0400 Subject: [PATCH 07/20] Pre-commit fixes --- .pre-commit-config.yaml | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5c5591c..6140fff 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -125,8 +125,4 @@ repos: - id: nbqa-isort args: ["--profile=black"] - id: nbqa-flake8 - args: - [ - "--extend-ignore=E203,E402,E501,F401,F841", - "--exclude=logs/*,data/*", - ] + args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"] From 5ffef7d9053c873d823bacd19a83125732219c4d Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:23:18 -0400 Subject: [PATCH 08/20] Clarify output format. --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 1066ff6..f49d442 100644 --- a/README.md +++ b/README.md @@ -113,6 +113,10 @@ meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/ \ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/" ``` +This will create a JSON file with the results in the directory provided by the `output_dir` argument. + +Note this package currently supports binary classification only. + ## Contributing to MEDS-DEV ### To Add a Model From 5756038e44acfc674cb7cfbfa9dac0b82b5afb7a Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 10 Sep 2024 17:29:40 -0400 Subject: [PATCH 09/20] Change arguments to more generic ones. --- README.md | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index f49d442..0a39a06 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,8 @@ This repository contains the dataset, task, model training recipes, and results effort for EHR machine learning. Note that this repository is _not_ a place where functional code is stored. Rather, this repository stores -configuration files, training recipes, results, etc. for the MEDS-DEV benchmarking effort -- runnable code will +configuration files, training recipes, results, etc. for the MEDS-DEV benchmarking effort -- runnable code +will often come from other repositories, with suitable permalinks being present in the various configuration files or commit messages for associated contributions to this repository. @@ -56,28 +57,33 @@ Task-related information is stored in Hydra configuration files (in `.yaml` form Task names are defined in a way that corresponds to the path to their configuration, starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory. For example, -`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of +`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` +of `mortality/in_icu/first_24h`. **To add a task** -If your task is not supported, you will need to add a directory and define an appropriate configuration file in +If your task is not supported, you will need to add a directory and define an appropriate configuration file +in a corresponding location. ### Dataset configuration file -Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific +Task configuration files are incomplete, because some concepts (predicates) have to be defined in a +dataset-specific way (e.g. `icu_admission` in `mortality/in_icu/first_24h`). These dataset-specific predicate definitions are found in `MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files. -In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. +In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory +ready (i.e. `$MEDS_ROOT_DIR`). **To add a dataset configuration file** -If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in +If your dataset is not supported, you will need to add a directory and define an appropriate configuration +file in a corresponding location. ### Run the MEDS task extraction helper @@ -89,14 +95,16 @@ From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is locate ``` This will use information from task and dataset-specific predicate configs to extract cohorts and labels from -`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same +`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining +the same sharded structure as the `$MEDS_ROOT_DIR/data` directory. ### Train the model This step depends on the API of your particular model. -For example, the command below will call a helper script that will generate random outputs for binary classification, +For example, the command below will call a helper script that will generate random outputs for binary +classification, conforming to MEDS binary classification prediction schema: ```bash @@ -105,12 +113,14 @@ conforming to MEDS binary classification prediction schema: ### Evaluate the model -You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to predictions +You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to +predictions dataframe as well as the output directory. For example, ```bash -meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/in_icu/first_24h/train/0.parquet" -\ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/" +meds-evaluation-cli \ +predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ +output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." ``` This will create a JSON file with the results in the directory provided by the `output_dir` argument. From c2df0738f43266a1c5d489929c44312ff4bdce2f Mon Sep 17 00:00:00 2001 From: kamilest Date: Tue, 8 Oct 2024 16:52:26 -0400 Subject: [PATCH 10/20] Spacing --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 0a39a06..14f166a 100644 --- a/README.md +++ b/README.md @@ -119,8 +119,8 @@ dataframe as well as the output directory. For example, ```bash meds-evaluation-cli \ -predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ -output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." + predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ + output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." ``` This will create a JSON file with the results in the directory provided by the `output_dir` argument. From b2affacdf9a0d5072a2b15c2307cf72a654a51fe Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 09:42:46 -0400 Subject: [PATCH 11/20] Updated pre-commit-config --- .pre-commit-config.yaml | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 6140fff..5c5591c 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -125,4 +125,8 @@ repos: - id: nbqa-isort args: ["--profile=black"] - id: nbqa-flake8 - args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"] + args: + [ + "--extend-ignore=E203,E402,E501,F401,F841", + "--exclude=logs/*,data/*", + ] From 26b439f862bb5120f136d7915720390f0a6cf362 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 09:47:39 -0400 Subject: [PATCH 12/20] Added simple doctest --- .../helpers/generate_random_predictions.py | 34 ++++++++++++++++--- 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py index a07a26f..5502438 100644 --- a/src/MEDS_DEV/helpers/generate_random_predictions.py +++ b/src/MEDS_DEV/helpers/generate_random_predictions.py @@ -50,12 +50,38 @@ def generate_random_predictions(cfg: DictConfig) -> None: predictions.write_parquet(predictions_path / split.name) -def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame: - """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe, - along with predictions.""" +def _generate_random_predictions(dataframe: pl.DataFrame, seed: int = 1) -> pl.DataFrame: + """Augments the input dataframe with random predictions. + + Args: + dataframe: Input dataframe with at least the columns: [subject_id, prediction_time, boolean_value] + seed: Seed for the random number generator. + + Returns: + An augmented dataframe with the boolean value and probability columns. + + Example: + >>> df = pl.DataFrame({ + ... "subject_id": [1, 2, 3], + ... "prediction_time": [0, 1, 2], + ... "boolean_value": [True, False, True] + ... }) + >>> _generate_random_predictions(df).drop(["prediction_time", "boolean_value"]) + shape: (3, 3) + ┌────────────┬─────────────────────────┬───────────────────────────────┐ + │ subject_id ┆ predicted_boolean_value ┆ predicted_boolean_probability │ + │ --- ┆ --- ┆ --- │ + │ i64 ┆ bool ┆ f64 │ + ╞════════════╪═════════════════════════╪═══════════════════════════════╡ + │ 1 ┆ true ┆ 0.511822 │ + │ 2 ┆ true ┆ 0.950464 │ + │ 3 ┆ false ┆ 0.14416 │ + └────────────┴─────────────────────────┴───────────────────────────────┘ + """ output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN]) - probabilities = np.random.uniform(0, 1, len(dataframe)) + rng = np.random.default_rng(seed) + probabilities = rng.uniform(0, 1, len(dataframe)) # TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to # be inserted at the correct position and cast to the correct type output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean)) From 07ab03ccd5057f9e2566b549289033851b9c5fd8 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 09:48:44 -0400 Subject: [PATCH 13/20] Updated pre-commit-config --- .pre-commit-config.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5c5591c..38d66f1 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -5,7 +5,7 @@ exclude: "docs/index.md" repos: - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v4.4.0 + rev: v5.0.0 hooks: # list of supported hooks: https://pre-commit.com/hooks.html - id: trailing-whitespace From e974c1fa1e9294db04f058a5f3446a507a61aa0a Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 09:59:42 -0400 Subject: [PATCH 14/20] Freeze pre-commit version until docformatter pushes a new release. --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index cc51b8b..649eef6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -33,7 +33,7 @@ dependencies = ["meds==0.3.3", "es-aces==0.5.0"] [tool.setuptools_scm] [project.optional-dependencies] -dev = ["pre-commit"] +dev = ["pre-commit<4"] tests = ["pytest", "pytest-cov", "rootutils"] docs = [ "mkdocs==1.6.0", "mkdocs-material==9.5.31", "mkdocstrings[python,shell]==0.25.2", "mkdocs-gen-files==0.5.0", From f66880d98f9c7c587c42a6ea11ceae3c038e7bff Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:03:03 -0400 Subject: [PATCH 15/20] Update workflow files to install correct pre-commit version. --- .github/workflows/code-quality-main.yaml | 4 ++++ .github/workflows/code-quality-pr.yaml | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml index ba2caf4..fd3195f 100644 --- a/.github/workflows/code-quality-main.yaml +++ b/.github/workflows/code-quality-main.yaml @@ -20,5 +20,9 @@ jobs: with: python-version: "3.10" + - name: Install packages + run: | + pip install -e .[dev] + - name: Run pre-commits uses: pre-commit/action@v3.0.1 diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml index 9a33678..f94fc04 100644 --- a/.github/workflows/code-quality-pr.yaml +++ b/.github/workflows/code-quality-pr.yaml @@ -23,6 +23,10 @@ jobs: with: python-version: "3.10" + - name: Install packages + run: | + pip install -e .[dev] + - name: Find modified files id: file_changes uses: trilom/file-changes-action@v1.2.4 From beba3ee7b894707ff41bfa557db9f8f8239e1159 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:05:08 -0400 Subject: [PATCH 16/20] Updateing workflows --- .github/workflows/code-quality-main.yaml | 6 +++--- .github/workflows/code-quality-pr.yaml | 4 ++-- .github/workflows/python-build.yaml | 2 +- .github/workflows/tests.yaml | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml index fd3195f..c79a12b 100644 --- a/.github/workflows/code-quality-main.yaml +++ b/.github/workflows/code-quality-main.yaml @@ -13,10 +13,10 @@ jobs: steps: - name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v4 - - name: Set up Python 3.10 - uses: actions/setup-python@v3 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: "3.10" diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml index f94fc04..8e2e4f1 100644 --- a/.github/workflows/code-quality-pr.yaml +++ b/.github/workflows/code-quality-pr.yaml @@ -16,10 +16,10 @@ jobs: steps: - name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v4 - name: Set up Python 3.10 - uses: actions/setup-python@v3 + uses: actions/setup-python@v5 with: python-version: "3.10" diff --git a/.github/workflows/python-build.yaml b/.github/workflows/python-build.yaml index 3f3c96d..b22ff87 100644 --- a/.github/workflows/python-build.yaml +++ b/.github/workflows/python-build.yaml @@ -10,7 +10,7 @@ jobs: steps: - uses: actions/checkout@v4 - name: Set up Python - uses: actions/setup-python@v4 + uses: actions/setup-python@v5 with: python-version: "3.10" - name: Install pypa/build diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml index 3faf789..31a46da 100644 --- a/.github/workflows/tests.yaml +++ b/.github/workflows/tests.yaml @@ -17,10 +17,10 @@ jobs: steps: - name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v4 - name: Set up Python 3.10 - uses: actions/setup-python@v3 + uses: actions/setup-python@v5 with: python-version: "3.10" From ae6f1d479d210c04532881a6e3e4f3217f0f380d Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:12:27 -0400 Subject: [PATCH 17/20] Updating workflows --- .github/workflows/code-quality-main.yaml | 2 +- .github/workflows/code-quality-pr.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml index c79a12b..874da12 100644 --- a/.github/workflows/code-quality-main.yaml +++ b/.github/workflows/code-quality-main.yaml @@ -22,7 +22,7 @@ jobs: - name: Install packages run: | - pip install -e .[dev] + pip install .[dev] - name: Run pre-commits uses: pre-commit/action@v3.0.1 diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml index 8e2e4f1..bee2e11 100644 --- a/.github/workflows/code-quality-pr.yaml +++ b/.github/workflows/code-quality-pr.yaml @@ -25,7 +25,7 @@ jobs: - name: Install packages run: | - pip install -e .[dev] + pip install .[dev] - name: Find modified files id: file_changes From ae744c60d5ec212b1121453602cf5414f9a7dae8 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:15:15 -0400 Subject: [PATCH 18/20] Updating README for pre-commit nonsense. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 14f166a..afd925f 100644 --- a/README.md +++ b/README.md @@ -119,8 +119,8 @@ dataframe as well as the output directory. For example, ```bash meds-evaluation-cli \ - predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ - output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." + predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ + output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." ``` This will create a JSON file with the results in the directory provided by the `output_dir` argument. From 94cf21677365f1182ee7d87e26b9ff2bcd6f7418 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:17:54 -0400 Subject: [PATCH 19/20] Updating README for pre-commit nonsense. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index afd925f..b0ec788 100644 --- a/README.md +++ b/README.md @@ -118,7 +118,7 @@ predictions dataframe as well as the output directory. For example, ```bash -meds-evaluation-cli \ +meds-evaluation-cli \ predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." ``` From 96c9bacf78297d2d90cdfbe8231140fc3f0265e3 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 10 Oct 2024 10:22:08 -0400 Subject: [PATCH 20/20] Updating README for pre-commit nonsense. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b0ec788..0d9c209 100644 --- a/README.md +++ b/README.md @@ -119,8 +119,8 @@ dataframe as well as the output directory. For example, ```bash meds-evaluation-cli \ - predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ - output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." + predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME//*.parquet" \ + output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME//..." ``` This will create a JSON file with the results in the directory provided by the `output_dir` argument.