From b28726ee7b3d135d0be601f4cb0ad12928b9412f Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Fri, 30 Aug 2024 16:52:50 -0400
Subject: [PATCH 01/20] Example workflow draft

---
 README.md | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)
diff --git a/README.md b/README.md
index 5f6d098..0b8d881 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,45 @@ configuration files, training recipes, results, etc. for the MEDS-DEV benchmarki
 often come from other repositories, with suitable permalinks being present in the various configuration files
 or commit messages for associated contributions to this repository.
 
+## Example workflow
+
+```bash
+# Create and enter a MEDS project directory 
+mkdir <my-meds-project-root>
+cd <my-meds-project-root>
+
+# Locate the MEDS data root directory <my-meds-dataset-path> and <dataset-name> 
+
+# Create a new python environment
+conda create -n <my-meds-env> python=3.10
+conda activate <my-meds-env>
+
+# In <my-meds-project-root>, install MEDS-DEV files and dependencies
+# TODO: this will be probably be replaced with `pip install MEDS-DEV` in the future
+git clone https://github.com/mmcdermott/MEDS-DEV.git
+pip install -e ./MEDS-DEV
+# TODO: consider the other dependencies that have not been deployed yet and are not in MEDS-DEV dependencies yet, e.g.:
+# git clone https://github.com/kamilest/meds-evaluation.git
+# pip install -e ./meds-evaluation
+# etc.
+
+# Install any model-specific dependencies
+
+# TODO: locate and process task predicates in ./MEDS-DEV/tasks/, defining the unknown codes using predicates in
+#   ./MEDS-DEV/datasets/<dataset-name>
+
+aces-cli data.path='<my-meds-dataset-path>', data.standard='meds', cohort_dir=TODO, cohort_name=TODO
+
+# TODO Figure out how ACES processes the cohort and where is the output stored: <aces-output>
+
+# TODO Train model on <aces-output>, place the outputs in the MEDS prediction format in 
+#   <my-meds-project-root>/predictions
+
+# Evaluate model
+meds-evaluation-cli predictions_path='<my-meds-project-root>/predictions', \ 
+  output_dir='<my-meds-project-root>/evaluation'
+```
+
 ## Contributing to MEDS-DEV
 
 ### To Add a Model

From 19ae1d9b4914b0a9e4c3132e4f930b35bc2436cd Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 3 Sep 2024 17:46:09 -0400
Subject: [PATCH 02/20] Update the instructions

---
 README.md | 91 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 73 insertions(+), 18 deletions(-)

diff --git a/README.md b/README.md
index 993e0ee..0d036c8 100644
--- a/README.md
+++ b/README.md
@@ -10,35 +10,90 @@ or commit messages for associated contributions to this repository.
 
 ## Example workflow
 
+### (Optional) Set up the MEDS project with environment
+
 ```bash
 # Create and enter a MEDS project directory 
-mkdir <my-meds-project-root>
-cd <my-meds-project-root>
+mkdir $MY_MEDS_PROJECT_ROOT
+cd $MY_MEDS_PROJECT_ROOT
+
+conda create -n $MY_MEDS_CONDA_ENV python=3.10
+conda activate $MY_MEDS_CONDA_ENV
+```
+
+Additionally install any model-related dependencies.
 
-# Locate the MEDS data root directory <my-meds-dataset-path> and <dataset-name> 
+### Install MEDS-DEV
 
-# Create a new python environment
-conda create -n <my-meds-env> python=3.10
-conda activate <my-meds-env>
+Clone the MEDS-DEV GitHub repo and install it locally.
+This will additionally install some MEDS data processing dependencies:
 
-# In <my-meds-project-root>, install MEDS-DEV files and dependencies
-# TODO: this will be probably be replaced with `pip install MEDS-DEV` in the future
+```bash
 git clone https://github.com/mmcdermott/MEDS-DEV.git
-pip install -e ./MEDS-DEV
-# TODO: consider the other dependencies that have not been deployed yet and are not in MEDS-DEV dependencies yet, e.g.:
-# git clone https://github.com/kamilest/meds-evaluation.git
-# pip install -e ./meds-evaluation
-# etc.
+cd ./MEDS-DEV
+pip install -e .
+```
+
+Install the MEDS evaluation package: 
+```bash
+git clone https://github.com/kamilest/meds-evaluation.git
+pip install -e ./meds-evaluation
+```
+
+Additionally, make sure any model-related dependencies are installed.
+
+### Extract a task from the MEDS dataset
+
+This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and 
+processing the data to create the label files. 
+
+### Find the task configuration file
+
+Task-related information is stored in Hydra configuration files (in `.yaml` format) under 
+`MEDS-DEV/src/MEDS_DEV/tasks/criteria`.
 
-# Install any model-specific dependencies
+Task names are defined in a way that corresponds to the path to their configuration,  
+starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory.
+For example, 
+`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of 
+`mortality/in_icu/first_24h`.
 
-# TODO: locate and process task predicates in ./MEDS-DEV/tasks/, defining the unknown codes using predicates in
-#   ./MEDS-DEV/datasets/<dataset-name>
+**To add a task**
 
-aces-cli data.path='<my-meds-dataset-path>', data.standard='meds', cohort_dir=TODO, cohort_name=TODO
+If your task is not supported, you will need to add a directory and define an appropriate configuration file in 
+a corresponding location.
 
-# TODO Figure out how ACES processes the cohort and where is the output stored: <aces-output>
+### Dataset configuration file
 
+Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific 
+way (e.g. `icu_admission` in `mortality/in_icu/first_24h`).
+
+These dataset-specific predicate definitions are found in 
+`MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files.
+
+In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. 
+`$MEDS_ROOT_DIR`).
+
+**To add a dataset configuration file**
+
+If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in 
+a corresponding location.
+
+### Run the MEDS task extraction helper
+
+From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is located, run
+
+```bash
+./MEDS-DEV/src/MEDS_DEV/helpers/extract_task.sh $MEDS_ROOT_DIR $DATASET_NAME $TASK_NAME
+```
+
+This will use information from task and dataset-specific predicate configs to extract cohorts and labels from
+`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same 
+sharded structure [??? TODO check] as the `$MEDS_ROOT_DIR/data` directory.
+
+### TODO: train and evaluate the model 
+
+```
 # TODO Train model on <aces-output>, place the outputs in the MEDS prediction format in 
 #   <my-meds-project-root>/predictions
 

From 8e9acab069457a29efdb073f62380295c2bae9a9 Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Mon, 9 Sep 2024 17:44:54 -0400
Subject: [PATCH 03/20] Helper for generating random binary classification
 labels.

---
 README.md                                     | 47 +++++++-------
 src/MEDS_DEV/configs/predictions.yaml         |  6 ++
 src/MEDS_DEV/helpers/generate_predictions.sh  | 16 +++++
 .../helpers/generate_random_predictions.py    | 65 +++++++++++++++++++
 4 files changed, 112 insertions(+), 22 deletions(-)
 create mode 100644 src/MEDS_DEV/configs/predictions.yaml
 create mode 100755 src/MEDS_DEV/helpers/generate_predictions.sh
 create mode 100644 src/MEDS_DEV/helpers/generate_random_predictions.py

diff --git a/README.md b/README.md
index 0d036c8..c460214 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@ or commit messages for associated contributions to this repository.
 ### (Optional) Set up the MEDS project with environment
 
 ```bash
-# Create and enter a MEDS project directory 
+# Create and enter a MEDS project directory
 mkdir $MY_MEDS_PROJECT_ROOT
 cd $MY_MEDS_PROJECT_ROOT
 
@@ -34,7 +34,8 @@ cd ./MEDS-DEV
 pip install -e .
 ```
 
-Install the MEDS evaluation package: 
+Install the MEDS evaluation package:
+
 ```bash
 git clone https://github.com/kamilest/meds-evaluation.git
 pip install -e ./meds-evaluation
@@ -44,39 +45,39 @@ Additionally, make sure any model-related dependencies are installed.
 
 ### Extract a task from the MEDS dataset
 
-This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and 
-processing the data to create the label files. 
+This step prepares the MEDS dataset for a task by extracting a cohort using inclusion/exclusion criteria and
+processing the data to create the label files.
 
 ### Find the task configuration file
 
-Task-related information is stored in Hydra configuration files (in `.yaml` format) under 
+Task-related information is stored in Hydra configuration files (in `.yaml` format) under
 `MEDS-DEV/src/MEDS_DEV/tasks/criteria`.
 
-Task names are defined in a way that corresponds to the path to their configuration,  
+Task names are defined in a way that corresponds to the path to their configuration,
 starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory.
-For example, 
-`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of 
+For example,
+`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of
 `mortality/in_icu/first_24h`.
 
 **To add a task**
 
-If your task is not supported, you will need to add a directory and define an appropriate configuration file in 
+If your task is not supported, you will need to add a directory and define an appropriate configuration file in
 a corresponding location.
 
 ### Dataset configuration file
 
-Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific 
+Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific
 way (e.g. `icu_admission` in `mortality/in_icu/first_24h`).
 
-These dataset-specific predicate definitions are found in 
+These dataset-specific predicate definitions are found in
 `MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files.
 
-In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e. 
+In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e.
 `$MEDS_ROOT_DIR`).
 
 **To add a dataset configuration file**
 
-If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in 
+If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in
 a corresponding location.
 
 ### Run the MEDS task extraction helper
@@ -88,20 +89,22 @@ From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is locate
 ```
 
 This will use information from task and dataset-specific predicate configs to extract cohorts and labels from
-`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same 
-sharded structure [??? TODO check] as the `$MEDS_ROOT_DIR/data` directory.
+`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same
+sharded structure as the `$MEDS_ROOT_DIR/data` directory.
 
-### TODO: train and evaluate the model 
+### Train the model
 
-```
-# TODO Train model on <aces-output>, place the outputs in the MEDS prediction format in 
-#   <my-meds-project-root>/predictions
+This step depends on the API of your particular model.
+
+For example, the command below will call a helper script that will generate random outputs for binary classification,
+conforming to MEDS binary classification prediction schema:
 
-# Evaluate model
-meds-evaluation-cli predictions_path='<my-meds-project-root>/predictions', \ 
-  output_dir='<my-meds-project-root>/evaluation'
+```bash
+./MEDS-DEV/src/MEDS_DEV/helpers/generate_predictions.sh $MEDS_ROOT_DIR $TASK_NAME
 ```
 
+### TODO evaluate the model
+
 ## Contributing to MEDS-DEV
 
 ### To Add a Model
diff --git a/src/MEDS_DEV/configs/predictions.yaml b/src/MEDS_DEV/configs/predictions.yaml
new file mode 100644
index 0000000..736ae72
--- /dev/null
+++ b/src/MEDS_DEV/configs/predictions.yaml
@@ -0,0 +1,6 @@
+defaults:
+  - _ACES_MD
+  - _self_
+  - override hydra/hydra_logging: disabled
+
+cohort_predictions_dir: "${oc.env:MEDS_ROOT_DIR}/task_predictions"
diff --git a/src/MEDS_DEV/helpers/generate_predictions.sh b/src/MEDS_DEV/helpers/generate_predictions.sh
new file mode 100755
index 0000000..606d0f4
--- /dev/null
+++ b/src/MEDS_DEV/helpers/generate_predictions.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+export MEDS_ROOT_DIR=$1
+export MEDS_DATASET_NAME=$2
+export MEDS_TASK_NAME=$3
+
+shift 3
+
+MEDS_DEV_REPO_DIR=$(python -c "from importlib.resources import files; print(files(\"MEDS_DEV\"))")
+export MEDS_DEV_REPO_DIR
+
+# TODO improve efficiency of prediction generator by using this
+# SHARDS=$(expand_shards "$MEDS_ROOT_DIR"/data)
+
+python -m MEDS_DEV.helpers.generate_random_predictions  --config-path="$MEDS_DEV_REPO_DIR"/configs \
+--config-name="predictions" "hydra.searchpath=[pkg://aces.configs]" "$@"
diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py
new file mode 100644
index 0000000..732bc34
--- /dev/null
+++ b/src/MEDS_DEV/helpers/generate_random_predictions.py
@@ -0,0 +1,65 @@
+import os
+from importlib.resources import files
+from pathlib import Path
+
+import hydra
+import numpy as np
+import polars as pl
+from omegaconf import DictConfig
+
+SUBJECT_ID = "subject_id"
+
+BOOLEAN_VALUE_COLUMN = "boolean_value"
+PREDICTED_BOOLEAN_VALUE_COLUMN = "predicted_boolean_value"
+PREDICTED_BOOLEAN_PROBABILITY_COLUMN = "predicted_boolean_probability"
+
+CONFIG = files("MEDS_DEV").joinpath("configs/predictions.yaml")
+
+
+@hydra.main(version_base=None, config_path=str(CONFIG.parent.resolve()), config_name=CONFIG.stem)
+def generate_random_predictions(cfg: DictConfig) -> None:
+    cohort_dir = cfg.cohort_dir  # cohort_dir: "${oc.env:MEDS_ROOT_DIR}/task_labels"
+    cohort_name = cfg.cohort_name  # cohort_name: ${task_name}; task_name: ${oc.env:MEDS_TASK_NAME}
+
+    cohort_dir = Path(cohort_dir) / cohort_name
+    cohort_predictions_dir = (
+        cfg.cohort_predictions_dir
+    )  # cohort_predictions_dir: "${oc.env:MEDS_ROOT_DIR}/task_predictions"
+
+    # TODO: use expand_shards helper from the script to access sharded dataframes directly
+    for split in cohort_dir.iterdir():
+        if split.is_dir() and split.name in {"train", "tuning", "held_out"}:  # train | tuning | held_out
+            for file in split.iterdir():
+                if file.is_file():
+                    dataframe = pl.read_parquet(file)
+                    predictions = _generate_random_predictions(dataframe)  # sharded dataframes
+
+                    # $MEDS_ROOT_DIR/task_predictions/$TASK_NAME/<split>/<file>.parquet
+                    predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
+                    os.makedirs(predictions_path, exist_ok=True)
+
+                    predictions.write_parquet(predictions_path / file.name)
+        elif split.is_file():
+            dataframe = pl.read_parquet(split)
+            predictions = _generate_random_predictions(dataframe)
+
+            predictions_path = Path(cohort_predictions_dir) / cohort_name
+            os.makedirs(predictions_path, exist_ok=True)
+
+            predictions.write_parquet(predictions_path / split.name)
+
+
+def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame:
+    """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe,
+    along with predictions."""
+
+    output = dataframe.select([SUBJECT_ID, BOOLEAN_VALUE_COLUMN])
+    probabilities = np.random.uniform(0, 1, len(dataframe))
+    output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()))
+    output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities))
+
+    return output
+
+
+if __name__ == "__main__":
+    generate_random_predictions()

From 3afe7bf6cbf1bdbf2a84a10484cee5fb5a8fb85c Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:18:47 -0400
Subject: [PATCH 04/20] Add prediction time

---
 src/MEDS_DEV/helpers/generate_random_predictions.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py
index 732bc34..1c5a01a 100644
--- a/src/MEDS_DEV/helpers/generate_random_predictions.py
+++ b/src/MEDS_DEV/helpers/generate_random_predictions.py
@@ -8,6 +8,7 @@
 from omegaconf import DictConfig
 
 SUBJECT_ID = "subject_id"
+PREDICTION_TIME = "prediction_time"
 
 BOOLEAN_VALUE_COLUMN = "boolean_value"
 PREDICTED_BOOLEAN_VALUE_COLUMN = "predicted_boolean_value"
@@ -53,7 +54,7 @@ def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame:
     """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe,
     along with predictions."""
 
-    output = dataframe.select([SUBJECT_ID, BOOLEAN_VALUE_COLUMN])
+    output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN])
     probabilities = np.random.uniform(0, 1, len(dataframe))
     output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()))
     output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities))

From 3811f40783906342ad4b3d78390f7dcad8a4a25f Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:18:58 -0400
Subject: [PATCH 05/20] Fix prediction positions and types

---
 src/MEDS_DEV/helpers/generate_random_predictions.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py
index 1c5a01a..a07a26f 100644
--- a/src/MEDS_DEV/helpers/generate_random_predictions.py
+++ b/src/MEDS_DEV/helpers/generate_random_predictions.py
@@ -56,8 +56,10 @@ def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame:
 
     output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN])
     probabilities = np.random.uniform(0, 1, len(dataframe))
-    output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()))
-    output.insert_column(-1, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities))
+    # TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to
+    #  be inserted at the correct position and cast to the correct type
+    output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))
+    output.insert_column(4, pl.Series(PREDICTED_BOOLEAN_PROBABILITY_COLUMN, probabilities))
 
     return output
 

From 9e0c0e432187fc1191922e2b06159231da11c6fe Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:19:10 -0400
Subject: [PATCH 06/20] Add evaluation example

---
 README.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c460214..1066ff6 100644
--- a/README.md
+++ b/README.md
@@ -103,7 +103,15 @@ conforming to MEDS binary classification prediction schema:
 ./MEDS-DEV/src/MEDS_DEV/helpers/generate_predictions.sh $MEDS_ROOT_DIR $TASK_NAME
 ```
 
-### TODO evaluate the model
+### Evaluate the model
+
+You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to predictions
+dataframe as well as the output directory. For example,
+
+```bash
+meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/in_icu/first_24h/train/0.parquet"
+\ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/"
+```
 
 ## Contributing to MEDS-DEV
 

From 791d18f19a1ab25f2659cf3dde127a87e9616e42 Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:19:18 -0400
Subject: [PATCH 07/20] Pre-commit fixes

---
 .pre-commit-config.yaml | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 5c5591c..6140fff 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -125,8 +125,4 @@ repos:
       - id: nbqa-isort
         args: ["--profile=black"]
       - id: nbqa-flake8
-        args:
-          [
-            "--extend-ignore=E203,E402,E501,F401,F841",
-            "--exclude=logs/*,data/*",
-          ]
+        args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"]

From 5ffef7d9053c873d823bacd19a83125732219c4d Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:23:18 -0400
Subject: [PATCH 08/20] Clarify output format.

---
 README.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/README.md b/README.md
index 1066ff6..f49d442 100644
--- a/README.md
+++ b/README.md
@@ -113,6 +113,10 @@ meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/
 \ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/"
 ```
 
+This will create a JSON file with the results in the directory provided by the `output_dir` argument.
+
+Note this package currently supports binary classification only.
+
 ## Contributing to MEDS-DEV
 
 ### To Add a Model

From 5756038e44acfc674cb7cfbfa9dac0b82b5afb7a Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 10 Sep 2024 17:29:40 -0400
Subject: [PATCH 09/20] Change arguments to more generic ones.

---
 README.md | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index f49d442..0a39a06 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,8 @@ This repository contains the dataset, task, model training recipes, and results
 effort for EHR machine learning.
 
 Note that this repository is _not_ a place where functional code is stored. Rather, this repository stores
-configuration files, training recipes, results, etc. for the MEDS-DEV benchmarking effort -- runnable code will
+configuration files, training recipes, results, etc. for the MEDS-DEV benchmarking effort -- runnable code
+will
 often come from other repositories, with suitable permalinks being present in the various configuration files
 or commit messages for associated contributions to this repository.
 
@@ -56,28 +57,33 @@ Task-related information is stored in Hydra configuration files (in `.yaml` form
 Task names are defined in a way that corresponds to the path to their configuration,
 starting from the `MEDS-DEV/src/MEDS_DEV/tasks/criteria` directory.
 For example,
-`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME` of
+`MEDS-DEV/src/MEDS_DEV/tasks/criteria/mortality/in_icu/first_24h.yaml` directory corresponds to a `$TASK_NAME`
+of
 `mortality/in_icu/first_24h`.
 
 **To add a task**
 
-If your task is not supported, you will need to add a directory and define an appropriate configuration file in
+If your task is not supported, you will need to add a directory and define an appropriate configuration file
+in
 a corresponding location.
 
 ### Dataset configuration file
 
-Task configuration files are incomplete, because some concepts (predicates) have to be defined in a dataset-specific
+Task configuration files are incomplete, because some concepts (predicates) have to be defined in a
+dataset-specific
 way (e.g. `icu_admission` in `mortality/in_icu/first_24h`).
 
 These dataset-specific predicate definitions are found in
 `MEDS-DEV/src/MEDS_DEV/datasets/$DATASET_NAME/predicates.yaml` Hydra configuration files.
 
-In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory ready (i.e.
+In addition to `$DATASET_NAME` (e.g. `MIMIC-IV`), you will also need to have your MEDS dataset directory
+ready (i.e.
 `$MEDS_ROOT_DIR`).
 
 **To add a dataset configuration file**
 
-If your dataset is not supported, you will need to add a directory and define an appropriate configuration file in
+If your dataset is not supported, you will need to add a directory and define an appropriate configuration
+file in
 a corresponding location.
 
 ### Run the MEDS task extraction helper
@@ -89,14 +95,16 @@ From your project directory (`$MY_MEDS_PROJECT_ROOT`) where `MEDS-DEV` is locate
 ```
 
 This will use information from task and dataset-specific predicate configs to extract cohorts and labels from
-`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining the same
+`$MEDS_ROOT_DIR/data`, and place them in `$MEDS_ROOT_DIR/task_labels/$TASK_NAME/` subdirectories, retaining
+the same
 sharded structure as the `$MEDS_ROOT_DIR/data` directory.
 
 ### Train the model
 
 This step depends on the API of your particular model.
 
-For example, the command below will call a helper script that will generate random outputs for binary classification,
+For example, the command below will call a helper script that will generate random outputs for binary
+classification,
 conforming to MEDS binary classification prediction schema:
 
 ```bash
@@ -105,12 +113,14 @@ conforming to MEDS binary classification prediction schema:
 
 ### Evaluate the model
 
-You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to predictions
+You can use the `meds-evaluation` package by running `meds-evaluation-cli` and providing the path to
+predictions
 dataframe as well as the output directory. For example,
 
 ```bash
-meds-evaluation-cli predictions_path="./meds_dataset/task_predictions/mortality/in_icu/first_24h/train/0.parquet"
-\ output_dir="./meds_dataset/task_evaluation/mortality/in_icu/first_24h/train/"
+meds-evaluation-cli  \
+predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
+output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
 ```
 
 This will create a JSON file with the results in the directory provided by the `output_dir` argument.

From c2df0738f43266a1c5d489929c44312ff4bdce2f Mon Sep 17 00:00:00 2001
From: kamilest <stankeviciute.kamile@gmail.com>
Date: Tue, 8 Oct 2024 16:52:26 -0400
Subject: [PATCH 10/20] Spacing

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 0a39a06..14f166a 100644
--- a/README.md
+++ b/README.md
@@ -119,8 +119,8 @@ dataframe as well as the output directory. For example,
 
 ```bash
 meds-evaluation-cli  \
-predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
-output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
+      predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
+      output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
 ```
 
 This will create a JSON file with the results in the directory provided by the `output_dir` argument.

From b2affacdf9a0d5072a2b15c2307cf72a654a51fe Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 09:42:46 -0400
Subject: [PATCH 11/20] Updated pre-commit-config

---
 .pre-commit-config.yaml | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 6140fff..5c5591c 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -125,4 +125,8 @@ repos:
       - id: nbqa-isort
         args: ["--profile=black"]
       - id: nbqa-flake8
-        args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"]
+        args:
+          [
+            "--extend-ignore=E203,E402,E501,F401,F841",
+            "--exclude=logs/*,data/*",
+          ]

From 26b439f862bb5120f136d7915720390f0a6cf362 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 09:47:39 -0400
Subject: [PATCH 12/20] Added simple doctest

---
 .../helpers/generate_random_predictions.py    | 34 ++++++++++++++++---
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/src/MEDS_DEV/helpers/generate_random_predictions.py b/src/MEDS_DEV/helpers/generate_random_predictions.py
index a07a26f..5502438 100644
--- a/src/MEDS_DEV/helpers/generate_random_predictions.py
+++ b/src/MEDS_DEV/helpers/generate_random_predictions.py
@@ -50,12 +50,38 @@ def generate_random_predictions(cfg: DictConfig) -> None:
             predictions.write_parquet(predictions_path / split.name)
 
 
-def _generate_random_predictions(dataframe: pl.DataFrame) -> pl.DataFrame:
-    """Creates a new dataframe with the same subject_id and boolean_value columns as in the input dataframe,
-    along with predictions."""
+def _generate_random_predictions(dataframe: pl.DataFrame, seed: int = 1) -> pl.DataFrame:
+    """Augments the input dataframe with random predictions.
+
+    Args:
+        dataframe: Input dataframe with at least the columns: [subject_id, prediction_time, boolean_value]
+        seed: Seed for the random number generator.
+
+    Returns:
+        An augmented dataframe with the boolean value and probability columns.
+
+    Example:
+        >>> df = pl.DataFrame({
+        ...     "subject_id": [1, 2, 3],
+        ...     "prediction_time": [0, 1, 2],
+        ...     "boolean_value": [True, False, True]
+        ... })
+        >>> _generate_random_predictions(df).drop(["prediction_time", "boolean_value"])
+        shape: (3, 3)
+        ┌────────────┬─────────────────────────┬───────────────────────────────┐
+        │ subject_id ┆ predicted_boolean_value ┆ predicted_boolean_probability │
+        │ ---        ┆ ---                     ┆ ---                           │
+        │ i64        ┆ bool                    ┆ f64                           │
+        ╞════════════╪═════════════════════════╪═══════════════════════════════╡
+        │ 1          ┆ true                    ┆ 0.511822                      │
+        │ 2          ┆ true                    ┆ 0.950464                      │
+        │ 3          ┆ false                   ┆ 0.14416                       │
+        └────────────┴─────────────────────────┴───────────────────────────────┘
+    """
 
     output = dataframe.select([SUBJECT_ID, PREDICTION_TIME, BOOLEAN_VALUE_COLUMN])
-    probabilities = np.random.uniform(0, 1, len(dataframe))
+    rng = np.random.default_rng(seed)
+    probabilities = rng.uniform(0, 1, len(dataframe))
     # TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to
     #  be inserted at the correct position and cast to the correct type
     output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))

From 07ab03ccd5057f9e2566b549289033851b9c5fd8 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 09:48:44 -0400
Subject: [PATCH 13/20] Updated pre-commit-config

---
 .pre-commit-config.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 5c5591c..38d66f1 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ exclude: "docs/index.md"
 
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.4.0
+    rev: v5.0.0
     hooks:
       # list of supported hooks: https://pre-commit.com/hooks.html
       - id: trailing-whitespace

From e974c1fa1e9294db04f058a5f3446a507a61aa0a Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 09:59:42 -0400
Subject: [PATCH 14/20] Freeze pre-commit version until docformatter pushes a
 new release.

---
 pyproject.toml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pyproject.toml b/pyproject.toml
index cc51b8b..649eef6 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,7 +33,7 @@ dependencies = ["meds==0.3.3", "es-aces==0.5.0"]
 [tool.setuptools_scm]
 
 [project.optional-dependencies]
-dev = ["pre-commit"]
+dev = ["pre-commit<4"]
 tests = ["pytest", "pytest-cov", "rootutils"]
 docs = [
   "mkdocs==1.6.0", "mkdocs-material==9.5.31", "mkdocstrings[python,shell]==0.25.2", "mkdocs-gen-files==0.5.0",

From f66880d98f9c7c587c42a6ea11ceae3c038e7bff Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:03:03 -0400
Subject: [PATCH 15/20] Update workflow files to install correct pre-commit
 version.

---
 .github/workflows/code-quality-main.yaml | 4 ++++
 .github/workflows/code-quality-pr.yaml   | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml
index ba2caf4..fd3195f 100644
--- a/.github/workflows/code-quality-main.yaml
+++ b/.github/workflows/code-quality-main.yaml
@@ -20,5 +20,9 @@ jobs:
         with:
           python-version: "3.10"
 
+      - name: Install packages
+        run: |
+          pip install -e .[dev]
+
       - name: Run pre-commits
         uses: pre-commit/action@v3.0.1
diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml
index 9a33678..f94fc04 100644
--- a/.github/workflows/code-quality-pr.yaml
+++ b/.github/workflows/code-quality-pr.yaml
@@ -23,6 +23,10 @@ jobs:
         with:
           python-version: "3.10"
 
+      - name: Install packages
+        run: |
+          pip install -e .[dev]
+
       - name: Find modified files
         id: file_changes
         uses: trilom/file-changes-action@v1.2.4

From beba3ee7b894707ff41bfa557db9f8f8239e1159 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:05:08 -0400
Subject: [PATCH 16/20] Updateing workflows

---
 .github/workflows/code-quality-main.yaml | 6 +++---
 .github/workflows/code-quality-pr.yaml   | 4 ++--
 .github/workflows/python-build.yaml      | 2 +-
 .github/workflows/tests.yaml             | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml
index fd3195f..c79a12b 100644
--- a/.github/workflows/code-quality-main.yaml
+++ b/.github/workflows/code-quality-main.yaml
@@ -13,10 +13,10 @@ jobs:
 
     steps:
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
 
-      - name: Set up Python 3.10
-        uses: actions/setup-python@v3
+      - name: Set up Python
+        uses: actions/setup-python@v5
         with:
           python-version: "3.10"
 
diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml
index f94fc04..8e2e4f1 100644
--- a/.github/workflows/code-quality-pr.yaml
+++ b/.github/workflows/code-quality-pr.yaml
@@ -16,10 +16,10 @@ jobs:
 
     steps:
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
 
       - name: Set up Python 3.10
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v5
         with:
           python-version: "3.10"
 
diff --git a/.github/workflows/python-build.yaml b/.github/workflows/python-build.yaml
index 3f3c96d..b22ff87 100644
--- a/.github/workflows/python-build.yaml
+++ b/.github/workflows/python-build.yaml
@@ -10,7 +10,7 @@ jobs:
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: "3.10"
       - name: Install pypa/build
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
index 3faf789..31a46da 100644
--- a/.github/workflows/tests.yaml
+++ b/.github/workflows/tests.yaml
@@ -17,10 +17,10 @@ jobs:
 
     steps:
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
 
       - name: Set up Python 3.10
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v5
         with:
           python-version: "3.10"
 

From ae6f1d479d210c04532881a6e3e4f3217f0f380d Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:12:27 -0400
Subject: [PATCH 17/20] Updating workflows

---
 .github/workflows/code-quality-main.yaml | 2 +-
 .github/workflows/code-quality-pr.yaml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/code-quality-main.yaml b/.github/workflows/code-quality-main.yaml
index c79a12b..874da12 100644
--- a/.github/workflows/code-quality-main.yaml
+++ b/.github/workflows/code-quality-main.yaml
@@ -22,7 +22,7 @@ jobs:
 
       - name: Install packages
         run: |
-          pip install -e .[dev]
+          pip install .[dev]
 
       - name: Run pre-commits
         uses: pre-commit/action@v3.0.1
diff --git a/.github/workflows/code-quality-pr.yaml b/.github/workflows/code-quality-pr.yaml
index 8e2e4f1..bee2e11 100644
--- a/.github/workflows/code-quality-pr.yaml
+++ b/.github/workflows/code-quality-pr.yaml
@@ -25,7 +25,7 @@ jobs:
 
       - name: Install packages
         run: |
-          pip install -e .[dev]
+          pip install .[dev]
 
       - name: Find modified files
         id: file_changes

From ae744c60d5ec212b1121453602cf5414f9a7dae8 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:15:15 -0400
Subject: [PATCH 18/20] Updating README for pre-commit nonsense.

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 14f166a..afd925f 100644
--- a/README.md
+++ b/README.md
@@ -119,8 +119,8 @@ dataframe as well as the output directory. For example,
 
 ```bash
 meds-evaluation-cli  \
-      predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
-      output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
+       predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
+       output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
 ```
 
 This will create a JSON file with the results in the directory provided by the `output_dir` argument.

From 94cf21677365f1182ee7d87e26b9ff2bcd6f7418 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:17:54 -0400
Subject: [PATCH 19/20] Updating README for pre-commit nonsense.

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index afd925f..b0ec788 100644
--- a/README.md
+++ b/README.md
@@ -118,7 +118,7 @@ predictions
 dataframe as well as the output directory. For example,
 
 ```bash
-meds-evaluation-cli  \
+meds-evaluation-cli \
        predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
        output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
 ```

From 96c9bacf78297d2d90cdfbe8231140fc3f0265e3 Mon Sep 17 00:00:00 2001
From: Matthew McDermott <mattmcdermott8@gmail.com>
Date: Thu, 10 Oct 2024 10:22:08 -0400
Subject: [PATCH 20/20] Updating README for pre-commit nonsense.

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index b0ec788..0d9c209 100644
--- a/README.md
+++ b/README.md
@@ -119,8 +119,8 @@ dataframe as well as the output directory. For example,
 
 ```bash
 meds-evaluation-cli \
-       predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
-       output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
+	predictions_path="./<$MEDS_ROOT_DIR>/task_predictions/$TASK_NAME/<train|tuning|held_out>/*.parquet" \
+	output_dir="./<$MEDS_ROOT_DIR>/task_evaluation/$TASK_NAME/<train|tuning|held_out>/..."
 ```
 
 This will create a JSON file with the results in the directory provided by the `output_dir` argument.