Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCLA DATA Predicates file #23

Merged
merged 9 commits into from
Nov 7, 2024
Merged

UCLA DATA Predicates file #23

merged 9 commits into from
Nov 7, 2024

Conversation

Simonlee711
Copy link
Collaborator

@Simonlee711 Simonlee711 commented Oct 15, 2024

Summary by CodeRabbit

  • New Features

    • Added a README.md file detailing datasets in the MEDS-DEV project.
    • Introduced a new section in predicates.yaml with regex patterns for hospital and emergency department operations.
  • Chores

    • Enhanced the .gitignore file to include entries for various development environments, improving version control management.

Copy link
Contributor

coderabbitai bot commented Oct 15, 2024

Warning

Rate limit exceeded

@Simonlee711 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 065b20b and f0b3af5.

Walkthrough

This pull request introduces several changes, including the addition of a new README.md file in the src/MEDS_DEV/datasets/UCLA/ directory, detailing datasets and associated tasks. It also updates the .gitignore file to include entries for macOS, PyCharm, Visual Studio Code, and Windows environments, enhancing its scope for ignored files. Additionally, a new section titled predicates is added to the predicates.yaml file, defining regex patterns for various hospital and emergency department operations.

Changes

File Path Change Summary
src/MEDS_DEV/datasets/UCLA/README.md New README.md file added outlining datasets and tasks related to the MEDS-DEV project.
.gitignore Updated to include rules for ignoring files and directories specific to macOS, PyCharm, Visual Studio Code, and Windows.
src/MEDS_DEV/datasets/UCLA/predicates.yaml New section added defining predicates with regex patterns for hospital and emergency department operations.

Possibly related PRs

  • Update Datasets README.md to clarify dataset contribution workflow #26: The changes in this PR also involve updates to the README.md file in the src/MEDS_DEV/datasets/ directory, which relates to dataset contributions, similar to the new README added in the main PR for the UCLA datasets.
  • Update dataset.md #31: This PR updates the dataset template in src/MEDS_DEV/templates/dataset.md, which aligns with the main PR's focus on enhancing documentation for datasets, including the addition of a new README file for UCLA datasets.

Suggested reviewers

  • mmcdermott
  • kamilest
  • Oufattole

🐰 In the garden of code, we hop and play,
With README and predicates brightening the day.
Ignoring the clutter, we tidy our space,
For datasets and patterns, we quicken our pace.
Let's celebrate changes, both big and small,
In the realm of MEDS, we’re having a ball! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
src/MEDS_DEV/datasets/UCLA/anemia.yaml (3)

1-4: LGTM! Minor formatting fix needed.

The description clearly explains the task's purpose and inclusion criteria. However, there's a trailing space on line 2 that should be removed.

Apply this diff to fix the formatting issue:

-  This task predicts whether the patient will have abnormal _hemoglobin_ values within a 24-hour window for a given patient. 
+  This task predicts whether the patient will have abnormal _hemoglobin_ values within a 24-hour window for a given patient.
🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)


6-31: LGTM! Consider adding a comment for clarity.

The predicates are well-defined and logically structured. The use of different codes for hemoglobin (LAB//220228//g/dl and LAB//50811//g/dL) is noted and might indicate different measurement methods or units.

Consider adding a brief comment explaining the reason for using two different hemoglobin codes. This would improve clarity for future maintainers. For example:

  # Hemoglobin codes for different measurement methods or units
  hemoglobin_1:
    code: LAB//220228//g/dl
  hemoglobin_2:
    code: LAB//50811//g/dL

35-51: LGTM! Consider adding a comment for consistency.

The windows are well-defined and align perfectly with the task description. The input and target windows are appropriately set to 24 hours each, with correct inclusivity settings to prevent overlap.

For consistency with the predicates section, consider adding a brief comment explaining the purpose of each window. For example:

windows:
  # Input window: 24-hour period for initial hemoglobin measurement
  input:
    # ... (existing code)

  # Target window: subsequent 24-hour period for predicting abnormal hemoglobin
  target:
    # ... (existing code)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 30d5631 and 999e209.

📒 Files selected for processing (1)
  • src/MEDS_DEV/datasets/UCLA/anemia.yaml (1 hunks)
🧰 Additional context used
🪛 yamllint
src/MEDS_DEV/datasets/UCLA/anemia.yaml

[error] 2-2: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)
src/MEDS_DEV/datasets/UCLA/anemia.yaml (2)

33-33: LGTM!

The trigger lab_start is appropriately defined and aligns with the task description of initiating the prediction when a lab measurement is taken.


1-51: Great job! File approved with minor suggestions.

This YAML file is well-structured and clearly defines the task for predicting abnormal hemoglobin values. The predicates, trigger, and windows are logically organized and align with the task description. The minor suggestions provided earlier (removing a trailing space, adding clarifying comments) will further enhance the file's readability and maintainability.

🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)

@mmcdermott
Copy link
Owner

@Simonlee711 is this ready for review, or should we convert it to a draft for now? I'm going through outstanding PRs to make sure I'm not blocking anything that needs my review. If it is ready for review, go ahead and add me as a reviewer and try to fix the PR check that is failing. I'm happy to have a call to help you get that set up to run automatically on commit for you.

@mmcdermott
Copy link
Owner

@Simonlee711 if you fix the PR workflow errors this is likely ready to go in. Note you can set up pre-commit to run locally and automatically fix issues by installing pre-commit then running pre-commit install in the root dir.

@kamilest
Copy link
Collaborator

@Simonlee711 @mmcdermott I fixed the pre-commit errors.

Maybe we can merge this and then update later as per #31 (@Simonlee711 maybe you could also take a look at that PR to see if you have any suggestions for what information new dataset PRs should have, and see if UCLA dataset has that information available).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
src/MEDS_DEV/datasets/UCLA/anemia.yaml (1)

1-4: Enhance description with specific clinical criteria.

The description should explicitly mention the abnormal hemoglobin threshold (13 g/dL) to make the clinical criteria immediately clear to readers.

 description: >-
   This task predicts whether the patient will have abnormal _hemoglobin_ values within a 24-hour window for a given patient.
   To be included, patients must have at least one measurement for hemoglobin within a given 24-hour period in the labs table
-  at UCLA.
+  at UCLA. Hemoglobin values above 13 g/dL are considered abnormal for this task.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 08a191a and 15fcf4a.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • src/MEDS_DEV/datasets/UCLA/README.md (1 hunks)
  • src/MEDS_DEV/datasets/UCLA/anemia.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_DEV/datasets/UCLA/README.md
🔇 Additional comments (4)
src/MEDS_DEV/datasets/UCLA/anemia.yaml (4)

28-31: LGTM!

The combined predicates are well-organized and correctly use OR logic to handle both hemoglobin codes.


11-15: ⚠️ Potential issue

Standardize unit formatting across hemoglobin codes.

The units are inconsistently formatted: g/dl vs g/dL. This could lead to confusion or potential issues with data processing.

   hemoglobin_1:
-    code: LAB//220228//g/dl
+    code: LAB//220228//g/dL
   hemoglobin_2:
     code: LAB//50811//g/dL
#!/bin/bash
# Search for other hemoglobin-related lab codes to ensure completeness
rg -i "hemoglobin.*g/d[lL]" --type yaml

33-51: LGTM! Verify window configuration with existing tasks.

The window configuration is well-structured with appropriate time ranges and non-overlapping windows. The labeling strategy is correct.

#!/bin/bash
# Search for similar window configurations in other tasks for consistency
rg -A 10 "windows:" --type yaml

7-9: Verify the lab code regex pattern.

The pattern ^LABS//.* appears overly broad and might match unintended lab codes. Consider making it more specific to match only hemoglobin-related lab codes.

Comment on lines 16 to 26
abnormal_hemoglobin_1:
code: LAB//220228//g/dl
value_min: null
value_max: 13
value_max_inclusive: False
abnormal_hemoglobin_2:
code: LAB//50811//g/dL
value_min: null
value_max: 13
value_max_inclusive: False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using YAML anchors to reduce predicate duplication.

The abnormal hemoglobin predicates share identical configuration except for the code. Consider using YAML anchors to reduce duplication and maintain consistency.

+  # Define the base abnormal configuration
+  abnormal_config: &abnormal_base
+    value_min: null
+    value_max: 13
+    value_max_inclusive: False
+
   abnormal_hemoglobin_1:
     code: LAB//220228//g/dl
-    value_min: null
-    value_max: 13
-    value_max_inclusive: False
+    <<: *abnormal_base
   abnormal_hemoglobin_2:
     code: LAB//50811//g/dL
-    value_min: null
-    value_max: 13
-    value_max_inclusive: False
+    <<: *abnormal_base
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
abnormal_hemoglobin_1:
code: LAB//220228//g/dl
value_min: null
value_max: 13
value_max_inclusive: False
abnormal_hemoglobin_2:
code: LAB//50811//g/dL
value_min: null
value_max: 13
value_max_inclusive: False
# Define the base abnormal configuration
abnormal_config: &abnormal_base
value_min: null
value_max: 13
value_max_inclusive: False
abnormal_hemoglobin_1:
code: LAB//220228//g/dl
<<: *abnormal_base
abnormal_hemoglobin_2:
code: LAB//50811//g/dL
<<: *abnormal_base

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
src/MEDS_DEV/datasets/UCLA/predicates.yaml (2)

12-15: Maintain consistent abbreviation documentation

Similar to the ED predicates, consider documenting that ICU stands for Intensive Care Unit for clarity.

The regex patterns are consistent with other predicates, which is good.


1-18: Add file-level documentation

Consider adding a comment section at the top of the file to document:

  1. The general structure of predicates
  2. When to use regex patterns vs direct codes
  3. Expected format and validation rules for data after "//"
  4. Examples of valid values for each predicate type

This will help maintain consistency as new predicates are added.

Would you like me to help draft this documentation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 15fcf4a and 065b20b.

📒 Files selected for processing (1)
  • src/MEDS_DEV/datasets/UCLA/predicates.yaml (1 hunks)
🔇 Additional comments (2)
src/MEDS_DEV/datasets/UCLA/predicates.yaml (2)

17-18: Verify the different pattern for death predicate

The death predicate uses a direct code "MORTALITY" instead of a regex pattern with prefix like other predicates. Please verify:

  1. Is this intentional?
  2. Should it follow the same pattern format as others (e.g., "^MORTALITY//")?
  3. Are there any specific requirements that necessitate this different format?

Let's check for similar usage patterns:

#!/bin/bash
# Check for other direct code usage vs regex patterns
echo "Checking for direct code usage:"
ast-grep --pattern 'code: $VALUE' 
echo -e "\nChecking for regex pattern usage:"
ast-grep --pattern 'code: { regex: $_}'

7-10: Consider standardizing naming convention

The abbreviation "ED" might not be immediately clear to all developers. Consider either:

  1. Using the full form "EMERGENCY_DEPARTMENT" to match the verbosity of other predicates
  2. Adding a comment documenting that ED stands for Emergency Department

Let's check if "ED" abbreviation is consistently used across the codebase:

Comment on lines +2 to +5
hospital_admission:
code: { regex: "^HOSPITAL_ADMISSION//.*" }
hospital_discharge:
code: { regex: "^HOSPITAL_DISCHARGE//.*" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Documentation needed for predicate pattern format

The regex patterns for hospital events are consistently implemented across multiple datasets (UCLA, MIMIC-IV), but there's no documentation explaining:

  • The expected format after the "//" delimiter
  • Example valid strings
  • Purpose of this pattern structure

This documentation would be valuable since:

  • The README.md is currently minimal with just TODOs
  • The pattern is used across multiple datasets
  • No examples or explanations exist in the codebase
🔗 Analysis chain

Consider documenting the expected format after "//"

The regex patterns for hospital events look good, but it would be helpful to document:

  1. The expected format of data after "//"
  2. Examples of valid strings

Let's check if there are any existing examples or documentation:


The previous search revealed that this pattern is consistently used across multiple datasets (MIMIC-IV, UCLA, AUMCdb). Let's check for documentation and example usage to make a more informed decision about whether documentation is really needed.


Let's check the README content and actual usage examples to make a final determination about whether documentation is needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for examples or documentation of these patterns
rg -i "HOSPITAL_(ADMISSION|DISCHARGE)//" --type md

Length of output: 6849


Script:

#!/bin/bash
# Let's check for any documentation files in the UCLA dataset directory
fd -e md -e txt . src/MEDS_DEV/datasets/UCLA/

# Also check for any comments or documentation in the predicates file
cat src/MEDS_DEV/datasets/UCLA/predicates.yaml

# Check if there are any test files that might show example usage
fd test src/MEDS_DEV/datasets/UCLA/

Length of output: 591


Script:

#!/bin/bash
# Let's check the README for any documentation about these patterns
cat src/MEDS_DEV/datasets/UCLA/README.md

# Also check if there are any examples in the codebase showing how these patterns are used
rg "HOSPITAL_ADMISSION//" -A 1 -B 1

Length of output: 807

@Simonlee711 Simonlee711 merged commit 6f97fa0 into main Nov 7, 2024
5 checks passed
@mmcdermott mmcdermott deleted the ucla-data branch January 21, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants