-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a basic example workflow. #16
Conversation
WalkthroughThe recent changes encompass updates to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ShellScript
participant PythonModule
participant Config
User->>ShellScript: Execute generate_predictions.sh
ShellScript->>Config: Load environment variables
ShellScript->>PythonModule: Call generate_random_predictions
PythonModule->>Config: Read configuration settings
PythonModule->>Dataframe: Load datasets
Dataframe->>PythonModule: Provide data
PythonModule->>Dataframe: Generate random predictions
PythonModule->>Output: Save predictions
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
src/MEDS_DEV/helpers/generate_random_predictions.py (1)
1-68
: Suggestion: Enhance error handling and logging.Consider adding the following enhancements to improve the robustness and debuggability of the script:
- Add error handling using
try
/except
blocks to gracefully handle potential exceptions, such as file I/O errors or invalid configurations.- Use the
logging
module to add informative log messages at different levels (e.g.,debug
,info
,warning
,error
) to help with debugging and monitoring.README.md (1)
24-24
: Suggestion: Minor formatting and punctuation adjustments.Consider making the following minor adjustments based on the static analysis hints:
- Line 24: Add a comma after "Additionally" to improve readability.
- Line 72: Remove the comma before "because" since the clause is essential to the meaning.
Also applies to: 72-72
Tools
LanguageTool
[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....(AI_HYDRA_LEO_MISSING_COMMA)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- .pre-commit-config.yaml (1 hunks)
- README.md (1 hunks)
- src/MEDS_DEV/configs/predictions.yaml (1 hunks)
- src/MEDS_DEV/helpers/generate_predictions.sh (1 hunks)
- src/MEDS_DEV/helpers/generate_random_predictions.py (1 hunks)
Files skipped from review due to trivial changes (2)
- .pre-commit-config.yaml
- src/MEDS_DEV/configs/predictions.yaml
Additional context used
LanguageTool
README.md
[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....(AI_HYDRA_LEO_MISSING_COMMA)
[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...(COMMA_BEFORE_BECAUSE)
Markdownlint
README.md
64-64: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
83-83: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
Additional comments not posted (3)
src/MEDS_DEV/helpers/generate_predictions.sh (1)
1-16
: LGTM!The bash script follows best practices and is well-structured. The use of environment variables for configuration is a good practice.
src/MEDS_DEV/helpers/generate_random_predictions.py (1)
1-68
: LGTM overall!The Python script is well-structured and modular. The use of
hydra
for configuration management is a good practice. The script handles different split types and file structures correctly.README.md (1)
Line range hint
1-129
: Significant improvements in usability and clarity!The enhancements to the
README.md
file greatly improve the usability and clarity of the project. The detailed instructions, examples, and well-organized content will help users get started quickly and follow the workflow smoothly. Great job!Tools
LanguageTool
[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....(AI_HYDRA_LEO_MISSING_COMMA)
[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...(COMMA_BEFORE_BECAUSE)
Markdownlint
64-64: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
83-83: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
@kamilest can you fix the code quality issues that are causing that workflow to fail? See here for details: https://github.com/mmcdermott/MEDS-DEV/actions/runs/10800915362/job/29959946112?pr=16 |
MEDS_DEV_REPO_DIR=$(python -c "from importlib.resources import files; print(files(\"MEDS_DEV\"))") | ||
export MEDS_DEV_REPO_DIR | ||
|
||
# TODO improve efficiency of prediction generator by using this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this TODO for? I'm not sure I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I'm reimplementing recursive search through the directory to find shards for which to generate the predictions, but I think this could be improved with the expand_shards
helper you have implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (7)
README.md (7)
12-26
: Great addition: Example workflow sectionThe new "Example workflow" section is a valuable addition to the README. It provides clear instructions for setting up the MEDS project environment, including helpful command-line examples in code blocks.
Consider adding a brief explanation of what
$MY_MEDS_PROJECT_ROOT
and$MY_MEDS_CONDA_ENV
represent to ensure users understand how to customize these variables for their setup.
27-43
: Excellent: Clear installation instructionsThe installation instructions for MEDS-DEV and the MEDS evaluation package are clear, concise, and well-formatted. The use of code blocks for command-line examples enhances usability.
Consider adding a note about potential system requirements or dependencies that users should have installed before proceeding with these steps.
47-69
: Good addition: Task extraction and configuration informationThis section provides valuable information about extracting tasks from the MEDS dataset, including details on configuration files and naming conventions. The guidelines for adding a new task are a helpful inclusion.
Consider the following improvements:
- Expand the section on adding a new task with more specific steps or a brief example.
- Add a note about any potential limitations or considerations when creating custom tasks.
🧰 Tools
🪛 Markdownlint
64-64: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
70-88
: Good addition: Dataset configuration informationThis section provides useful information about dataset configuration files and how to add new ones. The explanation of dataset-specific predicate definitions is clear and helpful.
Consider the following improvements:
- Expand the instructions for adding a new dataset configuration file with more specific steps or a brief example.
- In line 72, remove the comma before "because" as it's not necessary in this context: "Task configuration files are incomplete because some concepts (predicates) have to be defined in a dataset-specific way".
- Add a brief explanation of what predicates are in this context, as it might not be clear to all users.
🧰 Tools
🪛 LanguageTool
[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...(COMMA_BEFORE_BECAUSE)
🪛 Markdownlint
83-83: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
89-101
: Excellent: Clear instructions for task extractionThis section provides clear and detailed instructions for running the MEDS task extraction helper. The command-line example and explanation of the extraction process and output location are particularly helpful.
Consider adding a note about any prerequisites or dependencies required for running the task extraction helper, such as having the MEDS dataset already downloaded and properly structured in the $MEDS_ROOT_DIR.
114-128
: Excellent: Clear model evaluation instructionsThis section provides clear and detailed instructions for evaluating the model using the meds-evaluation package. The command-line example and explanation of the output are particularly helpful.
Consider the following improvements:
- Add a brief explanation of what the evaluation metrics are and how to interpret the results in the JSON file.
- In line 127, add a comma after "argument" for better readability: "This will create a JSON file with the results in the directory provided by the
output_dir
argument, Note this package currently supports binary classification only."- Consider mentioning any plans for supporting other types of classification or regression tasks in future versions of the package.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by theoutput_dir
argument. Note this package currently supports binary ...(AI_HYDRA_LEO_MISSING_COMMA)
Line range hint
1-128
: Overall: Excellent improvements to documentationThe changes to the README.md file significantly enhance the documentation for the MEDS-DEV project. The new sections provide a clear, structured workflow that covers installation, task extraction, model training, and evaluation. These additions will greatly improve the user experience and make it easier for new contributors to get started with the project.
To further improve the documentation, consider adding a table of contents at the beginning of the README. This would help users quickly navigate to specific sections of interest, especially as the document grows with more detailed instructions and examples.
🧰 Tools
🪛 LanguageTool
[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...(COMMA_BEFORE_BECAUSE)
[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by theoutput_dir
argument. Note this package currently supports binary ...(AI_HYDRA_LEO_MISSING_COMMA)
🪛 Markdownlint
64-64: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
83-83: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...(COMMA_BEFORE_BECAUSE)
[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by theoutput_dir
argument. Note this package currently supports binary ...(AI_HYDRA_LEO_MISSING_COMMA)
🪛 Markdownlint
README.md
64-64: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
83-83: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
🔇 Additional comments (2)
README.md (2)
7-8
: LGTM: Improved readabilityThe text split improves readability without changing the content. This is a good formatting adjustment.
102-112
: Good addition: Model training sectionThis section provides a helpful starting point for model training, acknowledging that the process may vary depending on the specific model's API. The example command for generating random outputs is useful for testing purposes.
Could you please clarify the purpose of the random output generation step? Is this intended as a placeholder example, or is it a legitimate part of the workflow for testing or baseline comparison? Consider adding a brief explanation of when and why users might want to use this random output generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
src/MEDS_DEV/helpers/generate_random_predictions.py (1)
30-30
: Implement theexpand_shards
helper to access sharded data directly.The TODO comment at line 30 indicates the need to use the
expand_shards
helper for accessing sharded dataframes directly. Implementing this will enhance code efficiency and maintainability.Would you like assistance in implementing the
expand_shards
helper or opening a GitHub issue to track this task?
dataframe = pl.read_parquet(file) | ||
predictions = _generate_random_predictions(dataframe) # sharded dataframes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for file I/O operations.
Reading parquet files without error handling may cause the script to crash if a file is missing or corrupted. Incorporate try-except blocks around file I/O operations to handle exceptions gracefully and provide informative error messages.
Apply this diff to add error handling:
for split in cohort_dir.iterdir():
if split.is_dir() and split.name in {"train", "tuning", "held_out"}:
for file in split.iterdir():
if file.is_file():
- dataframe = pl.read_parquet(file)
- predictions = _generate_random_predictions(dataframe)
+ try:
+ dataframe = pl.read_parquet(file)
+ predictions = _generate_random_predictions(dataframe)
+ except Exception as e:
+ print(f"Error processing {file}: {e}")
+ continue
# Rest of the code...
elif split.is_file():
- dataframe = pl.read_parquet(split)
- predictions = _generate_random_predictions(dataframe)
+ try:
+ dataframe = pl.read_parquet(split)
+ predictions = _generate_random_predictions(dataframe)
+ except Exception as e:
+ print(f"Error processing {split}: {e}")
+ return
Also applies to: 44-45
probabilities = rng.uniform(0, 1, len(dataframe)) | ||
# TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to | ||
# be inserted at the correct position and cast to the correct type | ||
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Generate boolean predictions using a threshold comparison.
Rounding uniform probabilities may not produce an unbiased random boolean outcome. Using a threshold comparison ensures a fair distribution of True
and False
values.
Apply this diff to update the boolean value generation:
- output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))
+ predicted_values = probabilities > 0.5
+ output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, predicted_values))
Alternatively, use rng.choice
for direct boolean generation:
- probabilities = rng.uniform(0, 1, len(dataframe))
+ predicted_values = rng.choice([True, False], size=len(dataframe))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean)) | |
predicted_values = probabilities > 0.5 | |
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, predicted_values)) |
for split in cohort_dir.iterdir(): | ||
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out | ||
for file in split.iterdir(): | ||
if file.is_file(): | ||
dataframe = pl.read_parquet(file) | ||
predictions = _generate_random_predictions(dataframe) # sharded dataframes | ||
|
||
# $MEDS_ROOT_DIR/task_predictions/$TASK_NAME/<split>/<file>.parquet | ||
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name | ||
os.makedirs(predictions_path, exist_ok=True) | ||
|
||
predictions.write_parquet(predictions_path / file.name) | ||
elif split.is_file(): | ||
dataframe = pl.read_parquet(split) | ||
predictions = _generate_random_predictions(dataframe) | ||
|
||
predictions_path = Path(cohort_predictions_dir) / cohort_name | ||
os.makedirs(predictions_path, exist_ok=True) | ||
|
||
predictions.write_parquet(predictions_path / split.name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated code to improve maintainability.
The code segments handling directory and file inputs contain duplicated logic, particularly in reading dataframes and writing predictions. Refactoring this section by extracting common functionality into helper functions will reduce code duplication and enhance readability.
Apply this refactor to consolidate duplicated code:
def generate_random_predictions(cfg: DictConfig) -> None:
# Existing code above...
+ def process_file(file, predictions_path):
+ dataframe = pl.read_parquet(file)
+ predictions = _generate_random_predictions(dataframe)
+ os.makedirs(predictions_path, exist_ok=True)
+ predictions.write_parquet(predictions_path / file.name)
+
for split in cohort_dir.iterdir():
if split.is_dir() and split.name in {"train", "tuning", "held_out"}:
for file in split.iterdir():
if file.is_file():
- dataframe = pl.read_parquet(file)
- predictions = _generate_random_predictions(dataframe)
-
- predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
- os.makedirs(predictions_path, exist_ok=True)
-
- predictions.write_parquet(predictions_path / file.name)
+ predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
+ process_file(file, predictions_path)
elif split.is_file():
- dataframe = pl.read_parquet(split)
- predictions = _generate_random_predictions(dataframe)
-
predictions_path = Path(cohort_predictions_dir) / cohort_name
- os.makedirs(predictions_path, exist_ok=True)
-
- predictions.write_parquet(predictions_path / split.name)
+ process_file(split, predictions_path)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for split in cohort_dir.iterdir(): | |
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out | |
for file in split.iterdir(): | |
if file.is_file(): | |
dataframe = pl.read_parquet(file) | |
predictions = _generate_random_predictions(dataframe) # sharded dataframes | |
# $MEDS_ROOT_DIR/task_predictions/$TASK_NAME/<split>/<file>.parquet | |
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name | |
os.makedirs(predictions_path, exist_ok=True) | |
predictions.write_parquet(predictions_path / file.name) | |
elif split.is_file(): | |
dataframe = pl.read_parquet(split) | |
predictions = _generate_random_predictions(dataframe) | |
predictions_path = Path(cohort_predictions_dir) / cohort_name | |
os.makedirs(predictions_path, exist_ok=True) | |
predictions.write_parquet(predictions_path / split.name) | |
def process_file(file, predictions_path): | |
dataframe = pl.read_parquet(file) | |
predictions = _generate_random_predictions(dataframe) | |
os.makedirs(predictions_path, exist_ok=True) | |
predictions.write_parquet(predictions_path / file.name) | |
for split in cohort_dir.iterdir(): | |
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out | |
for file in split.iterdir(): | |
if file.is_file(): | |
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name | |
process_file(file, predictions_path) | |
elif split.is_file(): | |
predictions_path = Path(cohort_predictions_dir) / cohort_name | |
process_file(split, predictions_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
.pre-commit-config.yaml (1)
Python 3.10 is not available in the environment.
The pre-commit configuration specifies Python 3.10, but it is not installed. This may cause pre-commit hooks to fail.
- Ensure Python 3.10 is installed in all development and CI environments.
- Alternatively, update
.pre-commit-config.yaml
to use an available Python version.🔗 Analysis chain
Line range hint
1-2
: Approve Python version specification and suggest verification.Explicitly setting the default Python version to 3.10 is a good practice for ensuring consistency across different environments.
To ensure project-wide compatibility and clarity:
- Verify that all project dependencies are compatible with Python 3.10.
- Update the project's README or documentation to clearly state the Python version requirement.
- Ensure that all development and production environments can support Python 3.10.
You can use the following command to check your current Python version and compare it with the specified version:
Additionally, review your
requirements.txt
orsetup.py
file to ensure all dependencies support Python 3.10:If you're using a virtual environment, make sure it's created with Python 3.10:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
python --versionLength of output: 92
Script:
cat requirements.txt # or cat setup.py
Length of output: 127
Script:
python3.10 -m venv venvLength of output: 103
Script:
#!/bin/bash # Check for alternative Python executables which python3 which python3.10 # Search for alternative dependency files fd --extension toml --type f pyproject.toml fd --extension toml --type f Pipfile # Check for Python scripts in the repository fd --extension py --type fLength of output: 474
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- .pre-commit-config.yaml (1 hunks)
🧰 Additional context used
🔇 Additional comments (2)
.pre-commit-config.yaml (2)
Line range hint
1-124
: Overall configuration looks comprehensive and well-maintained.The
.pre-commit-config.yaml
file contains a robust set of hooks for ensuring code quality, including formatting, linting, and various checks. The recent updates to the pre-commit-hooks version and the explicit Python version specification contribute to keeping the project up-to-date and consistent.As a final note, consider periodically reviewing and updating the versions of other hooks in this file to ensure you're benefiting from the latest improvements and security updates across all tools.
8-8
: Approve version update and suggest verification.The update of pre-commit-hooks from v4.4.0 to v5.0.0 is a good practice to keep the project up-to-date with the latest features and improvements.
To ensure compatibility and take advantage of any new features:
- Please review the changelog for pre-commit-hooks v5.0.0 to understand any breaking changes or new features.
- Verify that all existing hooks still function as expected with this new version.
You can use the following command to check the hooks:
This will run all the hooks against all files in the repository, allowing you to catch any potential issues introduced by the version update.
This is a very rough version of the workflow for installing MEDS-DEV, setting up the task and dataset configurations, running a model (with a helper script for generating random binary classification predictions), and evaluating the results.
Of course this will need to be improved in terms of usability, functionality, and robustness.
Summary by CodeRabbit
New Features
README.md
to include comprehensive setup and usage instructions for the MEDS-DEV project.Documentation
README.md
with detailed workflows, setup instructions, and contribution guidelines.