-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjusted join in flat reps to account for different timestamps with t… #107
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis update introduces several enhancements and optimizations to the EventStream project. New parameters were added to improve the handling of caching, error messages, and data processing, including Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Out of diff range and nitpick comments (2)
EventStream/baseline/FT_task_baseline.py (2)
49-49
: Add a docstring entry for the new parameteroverwrite_cache_filtered_task
.It's good practice to document all parameters in the function's docstring to improve code maintainability and readability.
Line range hint
243-243
: Useisinstance()
for type checking instead of comparing types directly.- if type(window_sizes) is not list: + if not isinstance(window_sizes, list):This change uses the more Pythonic
isinstance()
for type checking, which is generally preferred over direct type comparison.
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', | ||
strategy='forward' if '-' in window_size else 'backward') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor the join_asof
method to handle potential discrepancies in data alignment.
- df = filter_join_df.join_asof(df, by='subject_id', on='timestamp',
- strategy='forward' if '-' in window_size else 'backward')
+ strategy = 'forward' if '-' in window_size else 'backward'
+ try:
+ df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', strategy=strategy)
+ except Exception as e:
+ # Handle or log the exception
+ raise e
This refactoring adds error handling around the join_asof
method, which can be crucial when dealing with large datasets and complex joins.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', | |
strategy='forward' if '-' in window_size else 'backward') | |
strategy = 'forward' if '-' in window_size else 'backward' | |
try: | |
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', strategy=strategy) | |
except Exception as e: | |
# Handle or log the exception | |
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Out of diff range and nitpick comments (1)
EventStream/baseline/FT_task_baseline.py (1)
Line range hint
247-247
: Useisinstance()
for type checking instead of comparing types directly. This is more Pythonic and handles subclasses properly.- if type(window_sizes) is not list: + if not isinstance(window_sizes, list):
… to cast subject_id to str (pl.Utf8) to account for errors in Inovalon
…rmott/EventStreamGPT into join_flat_reps_with_taskdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- EventStream/data/dataset_polars.py (3 hunks)
- scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
scripts/build_flat_reps.py
13-13:
omegaconf.OmegaConf
imported but unused (F401)Remove unused import:
omegaconf.OmegaConf
EventStream/data/dataset_polars.py
204-204: Do not compare types, use
isinstance()
(E721)
1618-1618: Function definition does not bind loop variable
m
(B023)
1618-1618: Function definition does not bind loop variable
key_prefix
(B023)
1622-1622: Function definition does not bind loop variable
m
(B023)
1622-1622: Function definition does not bind loop variable
val_prefix
(B023)
1624-1624: Function definition does not bind loop variable
m
(B023)
1624-1624: Function definition does not bind loop variable
val_prefix
(B023)
1627-1627: Function definition does not bind loop variable
m
(B023)
1627-1627: Function definition does not bind loop variable
val_prefix
(B023)
1628-1628: Function definition does not bind loop variable
m
(B023)
1628-1628: Function definition does not bind loop variable
val_prefix
(B023)
1629-1629: Function definition does not bind loop variable
m
(B023)
1629-1629: Function definition does not bind loop variable
val_prefix
(B023)
1639-1639: Function definition does not bind loop variable
m
(B023)
Additional comments not posted (6)
scripts/build_flat_reps.py (3)
4-9
: Good handling of optional dependencies with graceful degradation.
25-41
: Well-structured data processing logic with clear separation of configuration and execution.
26-41
: Effective use of logging to track process flow and completion.EventStream/data/dataset_polars.py (3)
1420-1421
: The changes to castsubject_id
to string before filtering align with the PR objectives and improve consistency in ID handling.
1481-1482
: The modifications to castsubject_id
to string in_summarize_time_dependent_measurements
are consistent with the changes in other methods, ensuring uniform handling of IDs.
1545-1547
: The changes in_summarize_dynamic_measurements
to castsubject_id
to string and filter before joining are consistent with similar modifications in other methods, ensuring uniform ID handling.
scripts/build_flat_reps.py
Outdated
save_dir = Path(cfg.pop("save_dir")) | ||
window_sizes = cfg.pop("window_sizes") | ||
subjects_per_output_file = cfg.pop("subjects_per_output_file") if 'subjects_per_output_file' in cfg else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for missing configuration items to prevent runtime errors.
# Example of adding default values and error handling
save_dir = Path(cfg.pop("save_dir", "default/path"))
window_sizes = cfg.pop("window_sizes", [default_window_size])
subjects_per_output_file = cfg.pop("subjects_per_output_file", default_subjects_per_output_file)
scripts/build_flat_reps.py
Outdated
|
||
from pathlib import Path | ||
import hydra | ||
from omegaconf import DictConfig, OmegaConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import to clean up the code.
- from omegaconf import DictConfig, OmegaConf
+ from omegaconf import DictConfig
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from omegaconf import DictConfig, OmegaConf | |
from omegaconf import DictConfig |
Tools
Ruff
13-13:
omegaconf.OmegaConf
imported but unused (F401)Remove unused import:
omegaconf.OmegaConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (3)
EventStream/data/dataset_base.py (1)
Line range hint
226-269
: Consider optimizing the exception handling for better clarity and maintainability.The current implementation of exception handling within the loop (lines 232 and 269) could be improved for clarity and maintainability. Specifically, the generic
Exception
is caught, which might obscure the source of errors. It's generally a good practice to catch more specific exceptions. Additionally, the error messages could provide more context about the failure. Here's a proposed refactor:try: df = cls._load_input_df(df_name, all_columns, subject_id_col, subject_ids_map, subject_id_dtype) except SpecificExceptionType as e: # Replace SpecificExceptionType with the actual expected exception types raise ValueError(f"Error loading data from {df_name}: {str(e)}") from e ... try: new_events = cls._inc_df_col(events, "event_id", running_event_id_max) except SpecificExceptionType as e: # Replace SpecificExceptionType with the actual expected exception types raise ValueError(f"Error incrementing event_id for {event_type}: {str(e)}") from eThis change not only makes the code easier to understand and maintain but also aids in debugging by providing clearer, more descriptive error messages.
EventStream/data/dataset_polars.py (2)
Line range hint
204-204
: Useisinstance()
instead of direct type comparison for better practice and readability.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
1618-1618
: Bind loop variables correctly in function definitions to avoid late binding issues.- .map_alias(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"), + .map_alias(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),Also applies to: 1622-1622, 1624-1624, 1627-1627, 1628-1628, 1629-1629, 1639-1639
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- EventStream/data/dataset_base.py (2 hunks)
- EventStream/data/dataset_polars.py (6 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py
204-204: Do not compare types, use
isinstance()
(E721)
1618-1618: Function definition does not bind loop variable
m
(B023)
1618-1618: Function definition does not bind loop variable
key_prefix
(B023)
1622-1622: Function definition does not bind loop variable
m
(B023)
1622-1622: Function definition does not bind loop variable
val_prefix
(B023)
1624-1624: Function definition does not bind loop variable
m
(B023)
1624-1624: Function definition does not bind loop variable
val_prefix
(B023)
1627-1627: Function definition does not bind loop variable
m
(B023)
1627-1627: Function definition does not bind loop variable
val_prefix
(B023)
1628-1628: Function definition does not bind loop variable
m
(B023)
1628-1628: Function definition does not bind loop variable
val_prefix
(B023)
1629-1629: Function definition does not bind loop variable
m
(B023)
1629-1629: Function definition does not bind loop variable
val_prefix
(B023)
1639-1639: Function definition does not bind loop variable
m
(B023)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range and nitpick comments (2)
EventStream/data/dataset_polars.py (2)
Line range hint
204-204
: Useisinstance()
instead of comparing types directly.Direct type comparisons are generally less flexible and can lead to errors if subtypes are involved. Replace the type comparison with
isinstance()
.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
1620-1620
: Address loop variable binding issues in lambda functions.Python's late binding behavior can lead to unexpected results when using loop variables inside lambda functions. To fix this, you can use default arguments in the lambda functions to bind the current value of the loop variable at each iteration.
- .map_alias(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"), + .map_alias(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),Apply this pattern to all similar occurrences in the code to ensure that the lambda functions capture the correct values of
m
andkey_prefix
orval_prefix
at each iteration.Also applies to: 1624-1624, 1626-1626, 1629-1629, 1630-1630, 1631-1631, 1641-1641
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- EventStream/data/dataset_base.py (2 hunks)
- EventStream/data/dataset_polars.py (5 hunks)
- scripts/build_flat_reps.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
- EventStream/data/dataset_base.py
- scripts/build_flat_reps.py
Additional context used
Ruff
EventStream/data/dataset_polars.py
204-204: Do not compare types, use
isinstance()
(E721)
1620-1620: Function definition does not bind loop variable
m
(B023)
1620-1620: Function definition does not bind loop variable
key_prefix
(B023)
1624-1624: Function definition does not bind loop variable
m
(B023)
1624-1624: Function definition does not bind loop variable
val_prefix
(B023)
1626-1626: Function definition does not bind loop variable
m
(B023)
1626-1626: Function definition does not bind loop variable
val_prefix
(B023)
1629-1629: Function definition does not bind loop variable
m
(B023)
1629-1629: Function definition does not bind loop variable
val_prefix
(B023)
1630-1630: Function definition does not bind loop variable
m
(B023)
1630-1630: Function definition does not bind loop variable
val_prefix
(B023)
1631-1631: Function definition does not bind loop variable
m
(B023)
1631-1631: Function definition does not bind loop variable
val_prefix
(B023)
1641-1641: Function definition does not bind loop variable
m
(B023)
self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8)) | ||
df = self.subjects_df.filter( | ||
pl.col("subject_id").is_in([str(id) for id in include_only_subjects]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure consistent data type handling for subject_id
across methods.
It's a good practice to maintain consistency in how data types are handled across different methods. Here, the subject_id
is cast to string in several filtering operations. Consider creating a helper function to perform this casting and filtering operation, which can then be reused across these methods, enhancing code reusability and maintainability.
- self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8))
- df = self.subjects_df.filter(pl.col("subject_id").is_in([str(id) for id in include_only_subjects]))
+ df = self._filter_subjects_by_id(include_only_subjects)
And add a new method in the class:
def _filter_subjects_by_id(self, include_only_subjects):
self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8))
return self.subjects_df.filter(pl.col("subject_id").is_in([str(id) for id in include_only_subjects]))
Also applies to: 1483-1484, 1547-1551
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -46,6 +46,7 @@ def load_flat_rep( | |||
do_update_if_missing: bool = True, | |||
task_df_name: str | None = None, | |||
do_cache_filtered_task: bool = True, | |||
overwrite_cache_filtered_task: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper documentation for the new parameter.
The new parameter overwrite_cache_filtered_task
should be included in the function's docstring to maintain comprehensive documentation.
+ overwrite_cache_filtered_task: If `True`, the flat representations will be regenerated. If `False`, the cached file will be loaded if exists.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
overwrite_cache_filtered_task: bool = False, | |
overwrite_cache_filtered_task: bool = False, |
overwrite_cache_filtered_task: If `True`, the flat representations will be regenerated. If `False`, | ||
the cached file will be loaded if exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
Ensure all calls to load_flat_rep
include the new parameter overwrite_cache_filtered_task
.
The following calls to load_flat_rep
in sample_data/examine_synthetic_data.ipynb
do not include the new parameter and should be updated:
flat_reps = load_flat_rep(ESD, window_sizes=['7d'])
flat_reps = load_flat_rep(ESD, window_sizes=['FULL'], feature_inclusion_frequency=0.001)
flat_reps = load_flat_rep(ESD, window_sizes=['1d', '7d', 'FULL'])
flat_reps = load_flat_rep(ESD, window_sizes=['2d'], do_update_if_missing=False)
flat_reps = load_flat_rep(ESD, window_sizes=['1d', '7d', 'FULL'], subjects_included={'train': {0, 1, 2}})
flat_reps = load_flat_rep(ESD, window_sizes=['7d'], task_df_name='single_label_binary_classification')
flat_reps = load_flat_rep(ESD, window_sizes=['FULL', '1d'], task_df_name='multi_class_classification', do_cache_filtered_task=False)
flat_reps = load_flat_rep(ESD, window_sizes=['FULL', '1d'], task_df_name='single_label_binary_classification', subjects_included={'train': {0, 1, 2}})
Please update these calls to include the overwrite_cache_filtered_task
parameter as needed.
Analysis chain
LGTM! But verify the new parameter usage in the codebase.
The code changes are approved.
However, ensure that all function calls to load_flat_rep
are updated to use the new parameter if necessary.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all function calls to `load_flat_rep` match the new signature.
# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type python $'load_flat_rep'
Length of output: 63
Script:
#!/bin/bash
# Description: Verify all function calls to `load_flat_rep` match the new signature.
# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg 'load_flat_rep'
Length of output: 1215
Script:
#!/bin/bash
# Description: Extract lines around each call to `load_flat_rep` to verify the usage of the new parameter.
# Extract lines around the calls in the Jupyter notebook
rg -A 3 -B 3 'load_flat_rep' sample_data/examine_synthetic_data.ipynb
# Extract lines around the calls in the Python file
rg -A 3 -B 3 'load_flat_rep' EventStream/baseline/FT_task_baseline.py
Length of output: 3273
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (1)
EventStream/data/config.py (1)
806-809
: Add Documentation for New Parameters.The new parameters
tuning_subset_size
andtuning_subset_seed
should be documented in the docstring of the class.+ tuning_subset_size: If the tuning data should be subsampled randomly, this specifies the size of the + tuning subset. If `None` or "FULL", then the full tuning data is used. + tuning_subset_seed: If the tuning data should be subsampled randomly, this specifies the seed for + that random subsampling.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- EventStream/data/config.py (3 hunks)
Additional comments not posted (2)
EventStream/data/config.py (2)
880-881
: Add New Parameters to Class Definition.The new parameters
tuning_subset_size
andtuning_subset_seed
are correctly added to the class definition.
916-931
: Verify Validation Logic for New Parameters.The validation logic for the new parameters
tuning_subset_size
andtuning_subset_seed
in the__post_init__
method looks good. It ensures that the parameters are within the required range and sets a default seed if not provided.
No description provided.