Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a basic example workflow. #16

Merged
merged 21 commits into from
Oct 10, 2024
Merged

Add a basic example workflow. #16

merged 21 commits into from
Oct 10, 2024

Conversation

kamilest
Copy link
Collaborator

@kamilest kamilest commented Sep 10, 2024

This is a very rough version of the workflow for installing MEDS-DEV, setting up the task and dataset configurations, running a model (with a helper script for generating random binary classification predictions), and evaluating the results.

Of course this will need to be improved in terms of usability, functionality, and robustness.

Summary by CodeRabbit

  • New Features

    • Expanded the README.md to include comprehensive setup and usage instructions for the MEDS-DEV project.
    • Introduced a new configuration file for predictions, enhancing flexibility in managing prediction directories.
    • Added a shell script to automate the generation of random predictions within the MEDS_DEV framework.
    • Implemented functionality for generating random predictions based on input datasets, streamlining the evaluation process.
  • Documentation

    • Enhanced README.md with detailed workflows, setup instructions, and contribution guidelines.
    • Clarified the repository's purpose and improved formatting for better readability.

Copy link
Contributor

coderabbitai bot commented Sep 10, 2024

Caution

Review failed

The head commit changed during the review from 94cf216 to 96c9bac.

Walkthrough

The recent changes encompass updates to the .pre-commit-config.yaml, README.md, and several files within the src/MEDS_DEV directory. Key modifications include the update of the pre-commit-hooks version, enhancements to the README for user guidance, and the introduction of new scripts and configuration files aimed at generating random predictions and managing prediction settings. These updates collectively enhance the structure and usability of the MEDS-DEV project.

Changes

Files Change Summary
.pre-commit-config.yaml Updated pre-commit-hooks version from v4.4.0 to v5.0.0 and set default Python version to python3.10. Consolidated args key format.
README.md Expanded content with detailed setup instructions, workflow steps, and configuration guidance.
src/MEDS_DEV/configs/predictions.yaml Introduced configuration for prediction settings, including directory management and logging.
src/MEDS_DEV/helpers/generate_predictions.sh Added a script for generating random predictions, setting up environment variables, and invoking Python functionality.
src/MEDS_DEV/helpers/generate_random_predictions.py Introduced a module for generating random predictions from dataframes, utilizing Hydra for configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ShellScript
    participant PythonModule
    participant Config

    User->>ShellScript: Execute generate_predictions.sh
    ShellScript->>Config: Load environment variables
    ShellScript->>PythonModule: Call generate_random_predictions
    PythonModule->>Config: Read configuration settings
    PythonModule->>Dataframe: Load datasets
    Dataframe->>PythonModule: Provide data
    PythonModule->>Dataframe: Generate random predictions
    PythonModule->>Output: Save predictions
Loading

🐇 In a world of code, so bright and new,
I hop with joy, for changes are due!
From README to scripts, all polished and neat,
The MEDS-DEV project is now quite a treat!
With predictions to make and guidance so clear,
Let’s celebrate progress, give a cheer! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (2)
src/MEDS_DEV/helpers/generate_random_predictions.py (1)

1-68: Suggestion: Enhance error handling and logging.

Consider adding the following enhancements to improve the robustness and debuggability of the script:

  • Add error handling using try/except blocks to gracefully handle potential exceptions, such as file I/O errors or invalid configurations.
  • Use the logging module to add informative log messages at different levels (e.g., debug, info, warning, error) to help with debugging and monitoring.
README.md (1)

24-24: Suggestion: Minor formatting and punctuation adjustments.

Consider making the following minor adjustments based on the static analysis hints:

  • Line 24: Add a comma after "Additionally" to improve readability.
  • Line 72: Remove the comma before "because" since the clause is essential to the meaning.

Also applies to: 72-72

Tools
LanguageTool

[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....

(AI_HYDRA_LEO_MISSING_COMMA)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e7119d1 and 5756038.

Files selected for processing (5)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (1 hunks)
  • src/MEDS_DEV/configs/predictions.yaml (1 hunks)
  • src/MEDS_DEV/helpers/generate_predictions.sh (1 hunks)
  • src/MEDS_DEV/helpers/generate_random_predictions.py (1 hunks)
Files skipped from review due to trivial changes (2)
  • .pre-commit-config.yaml
  • src/MEDS_DEV/configs/predictions.yaml
Additional context used
LanguageTool
README.md

[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....

(AI_HYDRA_LEO_MISSING_COMMA)


[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...

(COMMA_BEFORE_BECAUSE)

Markdownlint
README.md

64-64: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Additional comments not posted (3)
src/MEDS_DEV/helpers/generate_predictions.sh (1)

1-16: LGTM!

The bash script follows best practices and is well-structured. The use of environment variables for configuration is a good practice.

src/MEDS_DEV/helpers/generate_random_predictions.py (1)

1-68: LGTM overall!

The Python script is well-structured and modular. The use of hydra for configuration management is a good practice. The script handles different split types and file structures correctly.

README.md (1)

Line range hint 1-129: Significant improvements in usability and clarity!

The enhancements to the README.md file greatly improve the usability and clarity of the project. The detailed instructions, examples, and well-organized content will help users get started quickly and follow the workflow smoothly. Great job!

Tools
LanguageTool

[uncategorized] ~24-~24: Possible missing comma found.
Context: ... conda activate $MY_MEDS_CONDA_ENV ``` Additionally install any model-related dependencies....

(AI_HYDRA_LEO_MISSING_COMMA)


[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...

(COMMA_BEFORE_BECAUSE)

Markdownlint

64-64: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

@mmcdermott
Copy link
Owner

@kamilest can you fix the code quality issues that are causing that workflow to fail? See here for details: https://github.com/mmcdermott/MEDS-DEV/actions/runs/10800915362/job/29959946112?pr=16
I'm happy to have a quick call about how to get it set up so that your terminal automatically checks for these issues on any commit, if that would be of interest.

MEDS_DEV_REPO_DIR=$(python -c "from importlib.resources import files; print(files(\"MEDS_DEV\"))")
export MEDS_DEV_REPO_DIR

# TODO improve efficiency of prediction generator by using this
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this TODO for? I'm not sure I understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I'm reimplementing recursive search through the directory to find shards for which to generate the predictions, but I think this could be improved with the expand_shards helper you have implemented.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (7)
README.md (7)

12-26: Great addition: Example workflow section

The new "Example workflow" section is a valuable addition to the README. It provides clear instructions for setting up the MEDS project environment, including helpful command-line examples in code blocks.

Consider adding a brief explanation of what $MY_MEDS_PROJECT_ROOT and $MY_MEDS_CONDA_ENV represent to ensure users understand how to customize these variables for their setup.


27-43: Excellent: Clear installation instructions

The installation instructions for MEDS-DEV and the MEDS evaluation package are clear, concise, and well-formatted. The use of code blocks for command-line examples enhances usability.

Consider adding a note about potential system requirements or dependencies that users should have installed before proceeding with these steps.


47-69: Good addition: Task extraction and configuration information

This section provides valuable information about extracting tasks from the MEDS dataset, including details on configuration files and naming conventions. The guidelines for adding a new task are a helpful inclusion.

Consider the following improvements:

  1. Expand the section on adding a new task with more specific steps or a brief example.
  2. Add a note about any potential limitations or considerations when creating custom tasks.
🧰 Tools
🪛 Markdownlint

64-64: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


70-88: Good addition: Dataset configuration information

This section provides useful information about dataset configuration files and how to add new ones. The explanation of dataset-specific predicate definitions is clear and helpful.

Consider the following improvements:

  1. Expand the instructions for adding a new dataset configuration file with more specific steps or a brief example.
  2. In line 72, remove the comma before "because" as it's not necessary in this context: "Task configuration files are incomplete because some concepts (predicates) have to be defined in a dataset-specific way".
  3. Add a brief explanation of what predicates are in this context, as it might not be clear to all users.
🧰 Tools
🪛 LanguageTool

[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...

(COMMA_BEFORE_BECAUSE)

🪛 Markdownlint

83-83: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


89-101: Excellent: Clear instructions for task extraction

This section provides clear and detailed instructions for running the MEDS task extraction helper. The command-line example and explanation of the extraction process and output location are particularly helpful.

Consider adding a note about any prerequisites or dependencies required for running the task extraction helper, such as having the MEDS dataset already downloaded and properly structured in the $MEDS_ROOT_DIR.


114-128: Excellent: Clear model evaluation instructions

This section provides clear and detailed instructions for evaluating the model using the meds-evaluation package. The command-line example and explanation of the output are particularly helpful.

Consider the following improvements:

  1. Add a brief explanation of what the evaluation metrics are and how to interpret the results in the JSON file.
  2. In line 127, add a comma after "argument" for better readability: "This will create a JSON file with the results in the directory provided by the output_dir argument, Note this package currently supports binary classification only."
  3. Consider mentioning any plans for supporting other types of classification or regression tasks in future versions of the package.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by the output_dir argument. Note this package currently supports binary ...

(AI_HYDRA_LEO_MISSING_COMMA)


Line range hint 1-128: Overall: Excellent improvements to documentation

The changes to the README.md file significantly enhance the documentation for the MEDS-DEV project. The new sections provide a clear, structured workflow that covers installation, task extraction, model training, and evaluation. These additions will greatly improve the user experience and make it easier for new contributors to get started with the project.

To further improve the documentation, consider adding a table of contents at the beginning of the README. This would help users quickly navigate to specific sections of interest, especially as the document grows with more detailed instructions and examples.

🧰 Tools
🪛 LanguageTool

[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...

(COMMA_BEFORE_BECAUSE)


[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by the output_dir argument. Note this package currently supports binary ...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Markdownlint

64-64: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5756038 and c2df073.

📒 Files selected for processing (1)
  • README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md

[formatting] ~72-~72: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause.
Context: ... Task configuration files are incomplete, because some concepts (predicates) have to be d...

(COMMA_BEFORE_BECAUSE)


[uncategorized] ~127-~127: Possible missing comma found.
Context: ...provided by the output_dir argument. Note this package currently supports binary ...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Markdownlint
README.md

64-64: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (2)
README.md (2)

7-8: LGTM: Improved readability

The text split improves readability without changing the content. This is a good formatting adjustment.


102-112: Good addition: Model training section

This section provides a helpful starting point for model training, acknowledging that the process may vary depending on the specific model's API. The example command for generating random outputs is useful for testing purposes.

Could you please clarify the purpose of the random output generation step? Is this intended as a placeholder example, or is it a legitimate part of the workflow for testing or baseline comparison? Consider adding a brief explanation of when and why users might want to use this random output generation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (1)
src/MEDS_DEV/helpers/generate_random_predictions.py (1)

30-30: Implement the expand_shards helper to access sharded data directly.

The TODO comment at line 30 indicates the need to use the expand_shards helper for accessing sharded dataframes directly. Implementing this will enhance code efficiency and maintainability.

Would you like assistance in implementing the expand_shards helper or opening a GitHub issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c2df073 and 26b439f.

📒 Files selected for processing (1)
  • src/MEDS_DEV/helpers/generate_random_predictions.py (1 hunks)
🧰 Additional context used

Comment on lines +35 to +36
dataframe = pl.read_parquet(file)
predictions = _generate_random_predictions(dataframe) # sharded dataframes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for file I/O operations.

Reading parquet files without error handling may cause the script to crash if a file is missing or corrupted. Incorporate try-except blocks around file I/O operations to handle exceptions gracefully and provide informative error messages.

Apply this diff to add error handling:

 for split in cohort_dir.iterdir():
     if split.is_dir() and split.name in {"train", "tuning", "held_out"}:
         for file in split.iterdir():
             if file.is_file():
-                dataframe = pl.read_parquet(file)
-                predictions = _generate_random_predictions(dataframe)
+                try:
+                    dataframe = pl.read_parquet(file)
+                    predictions = _generate_random_predictions(dataframe)
+                except Exception as e:
+                    print(f"Error processing {file}: {e}")
+                    continue
             # Rest of the code...

 elif split.is_file():
-    dataframe = pl.read_parquet(split)
-    predictions = _generate_random_predictions(dataframe)
+    try:
+        dataframe = pl.read_parquet(split)
+        predictions = _generate_random_predictions(dataframe)
+    except Exception as e:
+        print(f"Error processing {split}: {e}")
+        return

Also applies to: 44-45

probabilities = rng.uniform(0, 1, len(dataframe))
# TODO: meds-evaluation currently cares about the order of columns and types, so the new columns have to
# be inserted at the correct position and cast to the correct type
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Generate boolean predictions using a threshold comparison.

Rounding uniform probabilities may not produce an unbiased random boolean outcome. Using a threshold comparison ensures a fair distribution of True and False values.

Apply this diff to update the boolean value generation:

-    output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))
+    predicted_values = probabilities > 0.5
+    output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, predicted_values))

Alternatively, use rng.choice for direct boolean generation:

-    probabilities = rng.uniform(0, 1, len(dataframe))
+    predicted_values = rng.choice([True, False], size=len(dataframe))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, probabilities.round()).cast(pl.Boolean))
predicted_values = probabilities > 0.5
output.insert_column(3, pl.Series(PREDICTED_BOOLEAN_VALUE_COLUMN, predicted_values))

Comment on lines +31 to +51
for split in cohort_dir.iterdir():
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out
for file in split.iterdir():
if file.is_file():
dataframe = pl.read_parquet(file)
predictions = _generate_random_predictions(dataframe) # sharded dataframes

# $MEDS_ROOT_DIR/task_predictions/$TASK_NAME/<split>/<file>.parquet
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
os.makedirs(predictions_path, exist_ok=True)

predictions.write_parquet(predictions_path / file.name)
elif split.is_file():
dataframe = pl.read_parquet(split)
predictions = _generate_random_predictions(dataframe)

predictions_path = Path(cohort_predictions_dir) / cohort_name
os.makedirs(predictions_path, exist_ok=True)

predictions.write_parquet(predictions_path / split.name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicated code to improve maintainability.

The code segments handling directory and file inputs contain duplicated logic, particularly in reading dataframes and writing predictions. Refactoring this section by extracting common functionality into helper functions will reduce code duplication and enhance readability.

Apply this refactor to consolidate duplicated code:

 def generate_random_predictions(cfg: DictConfig) -> None:
     # Existing code above...
 
+    def process_file(file, predictions_path):
+        dataframe = pl.read_parquet(file)
+        predictions = _generate_random_predictions(dataframe)
+        os.makedirs(predictions_path, exist_ok=True)
+        predictions.write_parquet(predictions_path / file.name)
+
     for split in cohort_dir.iterdir():
         if split.is_dir() and split.name in {"train", "tuning", "held_out"}:
             for file in split.iterdir():
                 if file.is_file():
-                    dataframe = pl.read_parquet(file)
-                    predictions = _generate_random_predictions(dataframe)
-
-                    predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
-                    os.makedirs(predictions_path, exist_ok=True)
-
-                    predictions.write_parquet(predictions_path / file.name)
+                    predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
+                    process_file(file, predictions_path)
         elif split.is_file():
-            dataframe = pl.read_parquet(split)
-            predictions = _generate_random_predictions(dataframe)
-
             predictions_path = Path(cohort_predictions_dir) / cohort_name
-            os.makedirs(predictions_path, exist_ok=True)
-
-            predictions.write_parquet(predictions_path / split.name)
+            process_file(split, predictions_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for split in cohort_dir.iterdir():
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out
for file in split.iterdir():
if file.is_file():
dataframe = pl.read_parquet(file)
predictions = _generate_random_predictions(dataframe) # sharded dataframes
# $MEDS_ROOT_DIR/task_predictions/$TASK_NAME/<split>/<file>.parquet
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
os.makedirs(predictions_path, exist_ok=True)
predictions.write_parquet(predictions_path / file.name)
elif split.is_file():
dataframe = pl.read_parquet(split)
predictions = _generate_random_predictions(dataframe)
predictions_path = Path(cohort_predictions_dir) / cohort_name
os.makedirs(predictions_path, exist_ok=True)
predictions.write_parquet(predictions_path / split.name)
def process_file(file, predictions_path):
dataframe = pl.read_parquet(file)
predictions = _generate_random_predictions(dataframe)
os.makedirs(predictions_path, exist_ok=True)
predictions.write_parquet(predictions_path / file.name)
for split in cohort_dir.iterdir():
if split.is_dir() and split.name in {"train", "tuning", "held_out"}: # train | tuning | held_out
for file in split.iterdir():
if file.is_file():
predictions_path = Path(cohort_predictions_dir) / cohort_name / split.name
process_file(file, predictions_path)
elif split.is_file():
predictions_path = Path(cohort_predictions_dir) / cohort_name
process_file(split, predictions_path)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.pre-commit-config.yaml (1)

Python 3.10 is not available in the environment.

The pre-commit configuration specifies Python 3.10, but it is not installed. This may cause pre-commit hooks to fail.

  • Ensure Python 3.10 is installed in all development and CI environments.
  • Alternatively, update .pre-commit-config.yaml to use an available Python version.
🔗 Analysis chain

Line range hint 1-2: Approve Python version specification and suggest verification.

Explicitly setting the default Python version to 3.10 is a good practice for ensuring consistency across different environments.

To ensure project-wide compatibility and clarity:

  1. Verify that all project dependencies are compatible with Python 3.10.
  2. Update the project's README or documentation to clearly state the Python version requirement.
  3. Ensure that all development and production environments can support Python 3.10.

You can use the following command to check your current Python version and compare it with the specified version:

Additionally, review your requirements.txt or setup.py file to ensure all dependencies support Python 3.10:

If you're using a virtual environment, make sure it's created with Python 3.10:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

python --version

Length of output: 92


Script:

cat requirements.txt
# or
cat setup.py

Length of output: 127


Script:

python3.10 -m venv venv

Length of output: 103


Script:

#!/bin/bash
# Check for alternative Python executables
which python3
which python3.10

# Search for alternative dependency files
fd --extension toml --type f pyproject.toml
fd --extension toml --type f Pipfile

# Check for Python scripts in the repository
fd --extension py --type f

Length of output: 474

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 26b439f and 07ab03c.

📒 Files selected for processing (1)
  • .pre-commit-config.yaml (1 hunks)
🧰 Additional context used
🔇 Additional comments (2)
.pre-commit-config.yaml (2)

Line range hint 1-124: Overall configuration looks comprehensive and well-maintained.

The .pre-commit-config.yaml file contains a robust set of hooks for ensuring code quality, including formatting, linting, and various checks. The recent updates to the pre-commit-hooks version and the explicit Python version specification contribute to keeping the project up-to-date and consistent.

As a final note, consider periodically reviewing and updating the versions of other hooks in this file to ensure you're benefiting from the latest improvements and security updates across all tools.


8-8: Approve version update and suggest verification.

The update of pre-commit-hooks from v4.4.0 to v5.0.0 is a good practice to keep the project up-to-date with the latest features and improvements.

To ensure compatibility and take advantage of any new features:

  1. Please review the changelog for pre-commit-hooks v5.0.0 to understand any breaking changes or new features.
  2. Verify that all existing hooks still function as expected with this new version.

You can use the following command to check the hooks:

This will run all the hooks against all files in the repository, allowing you to catch any potential issues introduced by the version update.

@kamilest kamilest merged commit 30d5631 into main Oct 10, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants