-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test cases and (eventually) fixes for #114 #117
base: dev
Are you sure you want to change the base?
Conversation
…tt/EventStreamGPT into fix_114_typing_with_subject_IDs
WalkthroughThe updates primarily focus on ensuring consistent handling of subject IDs across the codebase by aligning data types and refining data processing. Additionally, enhancements to the caching mechanisms and shard creation process for deep learning representations have been implemented. A new script for building flat representation datasets has been added, and various test updates ensure compatibility and verify output formats. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script as build_flat_reps.py
participant Dataset as EventStream.data.dataset_polars.Dataset
participant CacheRep as cache_flat_representation
User->>Script: Execute script with config
Script->>Dataset: Load dataset
Dataset->>CacheRep: Resolve cache parameters
CacheRep->>Dataset: Cache flat representation
Dataset-->>Script: Processed Dataset
Script-->>User: Execution complete
This diagram outlines the flow of execution when building a flat representation dataset using the new script Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range and nitpick comments (4)
tests/test_e2e_runs.py (1)
Line range hint
5-14
: Reorder Imports to Comply with PEP 8The imports are scattered below the setup code, which violates PEP 8's guideline of having all module level imports at the top of the file. This should be corrected to improve readability and maintainability.
+ import json + import os + import subprocess + import unittest + from itertools import product + from pathlib import Path + from tempfile import TemporaryDirectory + from typing import Any + import polars as pl import rootutils root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True) - import json - import os - import subprocess - import unittest - from itertools import product - from pathlib import Path - from tempfile import TemporaryDirectory - from typing import Any - import polars as plTools
Ruff
5-5: Module level import not at top of file (E402)
6-6: Module level import not at top of file (E402)
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
EventStream/data/dataset_polars.py (3)
Line range hint
203-203
: Useisinstance()
for type checking instead of direct type comparison.Direct comparison of types using
==
can lead to unexpected behavior and is not considered a best practice. Useisinstance()
for checking object types as it is clearer and more reliable.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
1664-1664
: Ensure loop variables are correctly scoped in lambda functions.Using loop variables directly inside lambda functions can lead to unexpected behavior because the lambda may capture the variable by reference, not by value. This can be especially problematic in loops where the variable is bound to change. To fix this, you can use a default argument in the lambda to bind the current value of the loop variable at each iteration.
- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"), + .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),Also applies to: 1668-1668, 1670-1670, 1673-1673, 1674-1674, 1675-1675, 1682-1682
Line range hint
2029-2029
: Optimize dictionary key check.Use
key in dict
instead ofkey in dict.keys()
for checking if a key is in a dictionary. The former is more Pythonic and efficient because it does not create a temporary list of keys.- if feature not in cfg.vocabulary.idxmap.keys(): + if feature not in cfg.vocabulary.idxmap:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (6)
sample_data/dataset.yaml
is excluded by!**/*.yaml
sample_data/dataset_parquet.yaml
is excluded by!**/*.yaml
sample_data/raw_parquet/admit_vitals.parquet
is excluded by!**/*.parquet
,!**/*.parquet
sample_data/raw_parquet/labs.parquet
is excluded by!**/*.parquet
,!**/*.parquet
sample_data/raw_parquet/medications.parquet
is excluded by!**/*.parquet
,!**/*.parquet
sample_data/raw_parquet/subjects.parquet
is excluded by!**/*.parquet
,!**/*.parquet
Files selected for processing (2)
- EventStream/data/dataset_polars.py (3 hunks)
- tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py
5-5: Module level import not at top of file (E402)
6-6: Module level import not at top of file (E402)
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
12-12: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
16-16: Module level import not at top of file (E402)
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1664-1664: Function definition does not bind loop variable
m
(B023)
1664-1664: Function definition does not bind loop variable
key_prefix
(B023)
1668-1668: Function definition does not bind loop variable
m
(B023)
1668-1668: Function definition does not bind loop variable
val_prefix
(B023)
1670-1670: Function definition does not bind loop variable
m
(B023)
1670-1670: Function definition does not bind loop variable
val_prefix
(B023)
1673-1673: Function definition does not bind loop variable
m
(B023)
1673-1673: Function definition does not bind loop variable
val_prefix
(B023)
1674-1674: Function definition does not bind loop variable
m
(B023)
1674-1674: Function definition does not bind loop variable
val_prefix
(B023)
1675-1675: Function definition does not bind loop variable
m
(B023)
1675-1675: Function definition does not bind loop variable
val_prefix
(B023)
1682-1682: Function definition does not bind loop variable
m
(B023)
2029-2029: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (2)
tests/test_e2e_runs.py (2)
38-38
: Approved addition of Parquet dataset pathsThe addition of paths for Parquet datasets in the
setUp
method aligns with the PR's objective to enhance dataset handling capabilities.
121-131
: Approved modifications to handle Parquet datasets inbuild_dataset
methodThe updates to the
build_dataset
method to include handling for both standard and Parquet dataset formats are appropriate and align with the PR's objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (5)
EventStream/baseline/FT_task_baseline.py (1)
Line range hint
369-369
: Consider defining abstract methods inBaseSklearnModuleConfig
.
BaseSklearnModuleConfig
is an abstract base class without any abstract methods. This might be intentional, but if there are common methods that should be implemented by all subclasses, consider declaring them as abstract.EventStream/data/dataset_polars.py (4)
Line range hint
693-715
: Update subject event properties with enhanced data handling.The method
_update_subject_event_properties
has been updated to handlesubject_id
data types more robustly, aligning with the changes in the schema. The logic for calculatingn_events_per_subject
is well-implemented usinggroup_by
andagg
. However, consider adding error handling or logging for potential exceptions during these operations, especially when casting types or accessing dataframe columns that might not exist.
Line range hint
203-203
: Replace type comparison withisinstance()
.Direct type comparison using
type()
is generally discouraged in favor ofisinstance()
, which supports inheritance and is more flexible.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
1669-1687
: Ensure loop variables are correctly bound in lambda expressions.Several lambda expressions inside loops do not bind the loop variables
m
,key_prefix
, andval_prefix
correctly, which might lead to unexpected behavior due to late binding in Python. Consider using additional arguments in the lambda to correctly bind these variables at each iteration.- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count") + .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count")
Line range hint
2034-2034
: Optimize dictionary key existence check.Using
key in dict.keys()
is redundant and less efficient thankey in dict
. Simplifying this can improve both readability and performance.- if mod_col not in source_df.keys(): + if mod_col not in source_df:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
configs/dataset_base.yaml
is excluded by!**/*.yaml
Files selected for processing (4)
- EventStream/baseline/FT_task_baseline.py (3 hunks)
- EventStream/data/dataset_base.py (1 hunks)
- EventStream/data/dataset_polars.py (7 hunks)
- tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py
5-5: Module level import not at top of file (E402)
6-6: Module level import not at top of file (E402)
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
12-12: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
16-16: Module level import not at top of file (E402)
EventStream/baseline/FT_task_baseline.py
246-246: Do not compare types, use
isinstance()
(E721)
369-369:
BaseSklearnModuleConfig
is an abstract base class, but it has no abstract methods (B024)
397-397: No explicit
stacklevel
keyword argument found (B028)
545-545: Yoda conditions are discouraged, use
frac_subjects > 0
instead (SIM300)Replace Yoda condition with
frac_subjects > 0
EventStream/data/dataset_base.py
1488-1488: Loop control variable
m
not used within loop body (B007)Rename unused
m
to_m
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1669-1669: Function definition does not bind loop variable
m
(B023)
1669-1669: Function definition does not bind loop variable
key_prefix
(B023)
1673-1673: Function definition does not bind loop variable
m
(B023)
1673-1673: Function definition does not bind loop variable
val_prefix
(B023)
1675-1675: Function definition does not bind loop variable
m
(B023)
1675-1675: Function definition does not bind loop variable
val_prefix
(B023)
1678-1678: Function definition does not bind loop variable
m
(B023)
1678-1678: Function definition does not bind loop variable
val_prefix
(B023)
1679-1679: Function definition does not bind loop variable
m
(B023)
1679-1679: Function definition does not bind loop variable
val_prefix
(B023)
1680-1680: Function definition does not bind loop variable
m
(B023)
1680-1680: Function definition does not bind loop variable
val_prefix
(B023)
1687-1687: Function definition does not bind loop variable
m
(B023)
2034-2034: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (10)
tests/test_e2e_runs.py (3)
Line range hint
5-14
: Reorder imports to conform to PEP 8 guidelines.Imports should be at the top of the file unless there's a specific reason (e.g., avoiding circular dependencies). If the setup requires imports to be delayed, please document the reason to maintain clarity.
[REFACTOR_SUGGESTion]+ import json + import os + import subprocess + import unittest + from itertools import product + from pathlib import Path + from tempfile import TemporaryDirectory + from typing import Any + import polars as pl import rootutils root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)Tools
Ruff
5-5: Module level import not at top of file (E402)
6-6: Module level import not at top of file (E402)
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
121-131
: Ensure consistency in dataset building for Parquet format.The method now handles building datasets from Parquet files. Verify that the configurations and paths used align with the expected dataset structures and that the Parquet dataset is correctly handled in subsequent tests.
Verification successful
Ensure consistency in dataset building for Parquet format.
The usage of
dataset_from_parquet
appears to be correctly handled in the tests. The path is used in assertions and commands to ensure the dataset is built and verified properly.
tests/test_e2e_runs.py
:
"dataset_from_parquet",
f"save_dir={self.paths['dataset_from_parquet']}",
self._test_dataset_output((root / "sample_data" / "raw"), self.paths["dataset_from_parquet"])
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the Parquet dataset is correctly handled. # Test: Search for usage of the "dataset_from_parquet" path in test assertions. rg --type python $'dataset_from_parquet'Length of output: 70
Script:
#!/bin/bash # Description: Verify that the Parquet dataset is correctly handled. # Test: Search for usage of the "dataset_from_parquet" path in test assertions. rg --type py $'dataset_from_parquet'Length of output: 304
38-38
: Ensure the new dataset directory is utilized correctly.The addition of "dataset_from_parquet" suggests handling a new data format. Confirm that this directory is properly integrated into the test workflows.
Verification successful
The new dataset directory is utilized correctly.
The search results show that the "dataset_from_parquet" directory is referenced multiple times in the
tests/test_e2e_runs.py
file, indicating its integration into the test workflows.
tests/test_e2e_runs.py
:
- Line:
"dataset_from_parquet",
- Line:
f"save_dir={self.paths['dataset_from_parquet']}"
- Line:
self._test_dataset_output((root / "sample_data" / "raw"), self.paths["dataset_from_parquet"])
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the "dataset_from_parquet" directory is used in relevant test cases. # Test: Search for usage of the "dataset_from_parquet" directory. rg --type python $'dataset_from_parquet'Length of output: 70
Script:
#!/bin/bash # Description: Verify that the "dataset_from_parquet" directory is used in relevant test cases. # Test: Search for usage of the "dataset_from_parquet" directory. rg --type py 'dataset_from_parquet'Length of output: 304
EventStream/baseline/FT_task_baseline.py (3)
Line range hint
246-246
: Useisinstance
for type checks.Replace direct type comparisons with
isinstance
for better style and reliability.
[REFACTOR_SUGGESTion]- if type(window_sizes) is not list: + if not isinstance(window_sizes, list):
Line range hint
397-397
: Specifystacklevel
in warning calls.When issuing warnings, it's helpful to specify the
stacklevel
so that the warning points to the caller's location in the stack, making it easier to locate the source of the issue.
[REFACTOR_SUGGESTion]- warnings.warn(f"Parameter {k} not in signature of {cls.__name__}. Dropping") + warnings.warn(f"Parameter {k} not in signature of {cls.__name__}. Dropping", stacklevel=2)
Line range hint
545-545
: Correct the order in the conditional expression.It's more readable to place the variable on the left side of the comparison.
[REFACTOR_SUGGESTion]- case float() as frac_subjects if 0 < frac_subjects and frac_subjects < 1: + case float() as frac_subjects if frac_subjects > 0 and frac_subjects < 1:EventStream/data/dataset_base.py (4)
1370-1370
: Approval of thedo_overwrite
handling in data caching.The addition of the
do_overwrite
condition in thecache_deep_learning_representation
method is a good practice. It allows better control over data management by providing an option to avoid overwriting existing files, which can prevent data loss.
1370-1370
: Proper management of shard file existence and overwriting logic.The logic to check the existence of the shard configuration file (
shards_fp
) and handle thedo_overwrite
flag is well-implemented. This ensures that the data integrity is maintained by not accidentally overwriting data unless explicitly instructed.
1370-1370
: Well-structured method with robust error handling.The
cache_deep_learning_representation
method is well-structured and includes comprehensive error handling and logging. This design enhances the robustness and maintainability of the method, making it easier to manage complex data operations and debug issues.
1370-1370
: Consider renaming unused loop variable.The static analysis tool flagged that the variable
m
in the loop on line 1488 is not used within the loop body. It's a good practice to replace unused loop variables with_
to indicate that they are intentionally unused.- for m, cfg in self.measurement_configs.items(): + for _, cfg in self.measurement_configs.items():Likely invalid or redundant comment.
EventStream/data/dataset_polars.py
Outdated
logger.debug( | ||
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for " | ||
f"{col} to {df.schema[col]}" | ||
) | ||
if isinstance(list(incl_targets)[0], str): | ||
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8) | ||
else: | ||
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col]) | ||
|
||
incl_list = incl_list.cast(df.schema[col]) | ||
|
||
logger.debug(f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance filtering logic with dynamic type casting.
The _filter_col_inclusion
method introduces dynamic type casting based on the dataframe's schema, which is a robust approach for handling various data types. However, the error handling could be improved by providing more specific error messages related to the failure of type conversion, and possibly logging the values that caused the error.
tests/test_e2e_runs.py
Outdated
def _test_dataset_output(self, raw_data_root: Path, dataset_save_dir: Path): | ||
DL_save_dir = dataset_save_dir / "DL_reps" | ||
|
||
train_files = list((DL_save_dir / "train").glob("*.parquet")) | ||
tuning_files = list((DL_save_dir / "tuning").glob("*.parquet")) | ||
held_out_files = list((DL_save_dir / "held_out").glob("*.parquet")) | ||
|
||
self.assertTrue(len(train_files) > 0) | ||
self.assertTrue(len(tuning_files) > 0) | ||
self.assertTrue(len(held_out_files) > 0) | ||
|
||
train_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in train_files]) | ||
tuning_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in tuning_files]) | ||
held_out_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in held_out_files]) | ||
|
||
DL_shards = json.loads((dataset_save_dir / "DL_shards.json").read_text()) | ||
|
||
ESD_subjects = pl.read_parquet(dataset_save_dir / "subjects_df.parquet", use_pyarrow=False) | ||
|
||
# Check that the DL shards are correctly partitioned. | ||
all_subjects = set(ESD_subjects["subject_id"].unique().to_list()) | ||
|
||
self.assertEqual(len(all_subjects), len(ESD_subjects)) | ||
|
||
all_subj_in_DL_shards = set().union(*DL_shards.values()) | ||
|
||
self.assertEqual(all_subjects, all_subj_in_DL_shards) | ||
|
||
train_DL_subjects = set(train_DL_reps["subject_id"].unique().to_list()) | ||
tuning_DL_subjects = set(tuning_DL_reps["subject_id"].unique().to_list()) | ||
held_out_DL_subjects = set(held_out_DL_reps["subject_id"].unique().to_list()) | ||
|
||
self.assertEqual(len(train_DL_subjects & tuning_DL_subjects), 0) | ||
self.assertEqual(len(train_DL_subjects & held_out_DL_subjects), 0) | ||
self.assertEqual(len(tuning_DL_subjects & held_out_DL_subjects), 0) | ||
|
||
self.assertTrue(len(train_DL_subjects) > len(tuning_DL_subjects)) | ||
self.assertTrue(len(train_DL_subjects) > len(held_out_DL_subjects)) | ||
|
||
all_DL_subjects = train_DL_subjects | tuning_DL_subjects | held_out_DL_subjects | ||
|
||
self.assertEqual(all_DL_subjects, all_subjects) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance documentation for _test_dataset_output
.
The method checks the integrity of dataset outputs in Parquet format. Adding more detailed comments explaining the checks performed (e.g., why specific sets of subjects are compared) would enhance maintainability and readability.
+ # Check that the DL shards are correctly partitioned.
+ # Ensure no subjects overlap between training, tuning, and held-out datasets.
+ # Verify that the number of subjects in training is greater than in other datasets.
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (4)
EventStream/data/dataset_polars.py (4)
Line range hint
203-203
: Replace type comparison withisinstance()
for Pythonic code style.Direct type comparisons are not recommended. Use
isinstance()
for checking object types to ensure compatibility with Python's type system.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
693-715
: Consider enhancing the handling of subjects with no events.The current implementation efficiently identifies subjects without events and assigns them a count of 0. However, this could be further optimized or simplified using direct polars operations, which might improve performance and readability.
- subjects_with_no_events = set(subjects_df_subjects["subject_id"].to_list()) - self.subject_ids - for sid in subjects_with_no_events: - self.n_events_per_subject[sid] = 0 - self.subject_ids.update(subjects_with_no_events) + # Assuming `subjects_df_subjects` and `self.subject_ids` are both sets of subject IDs + self.n_events_per_subject.update({sid: 0 for sid in subjects_df_subjects - self.subject_ids}) + self.subject_ids.update(subjects_df_subjects - self.subject_ids)
Line range hint
2033-2033
: Use direct dictionary membership test instead of.keys()
for efficiency.Directly checking membership in a dictionary is more efficient and Pythonic than using
.keys()
.- if mod_col not in source_df.keys(): + if mod_col not in source_df:
Line range hint
1668-1686
: Ensure loop variables are correctly bound in closures.Function definitions within loops should explicitly bind loop variables to avoid referencing the wrong variable due to late binding.
- key_col = cs.starts_with(key_prefix) - val_col = cs.starts_with(val_prefix).drop_nans().cast(pl.Float32) + key_col = cs.starts_with(key_prefix[:]) + val_col = cs.starts_with(val_prefix[:]).drop_nans().cast(pl.Float32)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- EventStream/data/dataset_polars.py (7 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1668-1668: Function definition does not bind loop variable
m
(B023)
1668-1668: Function definition does not bind loop variable
key_prefix
(B023)
1672-1672: Function definition does not bind loop variable
m
(B023)
1672-1672: Function definition does not bind loop variable
val_prefix
(B023)
1674-1674: Function definition does not bind loop variable
m
(B023)
1674-1674: Function definition does not bind loop variable
val_prefix
(B023)
1677-1677: Function definition does not bind loop variable
m
(B023)
1677-1677: Function definition does not bind loop variable
val_prefix
(B023)
1678-1678: Function definition does not bind loop variable
m
(B023)
1678-1678: Function definition does not bind loop variable
val_prefix
(B023)
1679-1679: Function definition does not bind loop variable
m
(B023)
1679-1679: Function definition does not bind loop variable
val_prefix
(B023)
1686-1686: Function definition does not bind loop variable
m
(B023)
2033-2033: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (1)
EventStream/data/dataset_polars.py (1)
731-742
: Enhance filtering logic with dynamic type casting.The
_filter_col_inclusion
method introduces dynamic type casting based on the dataframe's schema, which is a robust approach for handling various data types. However, the error handling could be improved by providing more specific error messages related to the failure of type conversion, and possibly logging the values that caused the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Outside diff range and nitpick comments (4)
EventStream/data/pytorch_dataset.py (1)
Line range hint
209-209
: Avoid shadowing the loop variable.The variable
n_events
is used both as a loop control variable and an iterable, which can lead to confusing or erroneous behavior.- for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)): + for i, (subj, num_events) in enumerate(zip(subject_ids, n_events)):EventStream/data/dataset_base.py (1)
Line range hint
1485-1485
: Rename unused loop variable.The loop variable
m
is not used within the loop body. It can be renamed to_
to indicate that it is intentionally unused.- for m, cfg in self.measurement_configs.items(): + for _, cfg in self.measurement_configs.items():EventStream/data/dataset_polars.py (2)
Line range hint
203-203
: Useisinstance()
for type checking instead of direct type comparison.Using
isinstance()
is the recommended way to check types in Python as it supports inheritance.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
2039-2039
: Simplify dictionary key existence check.Use direct key checking in dictionaries without calling
.keys()
.- if mod_col not in source_df.keys(): + if mod_col not in source_df:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- EventStream/data/dataset_base.py (2 hunks)
- EventStream/data/dataset_polars.py (7 hunks)
- EventStream/data/pytorch_dataset.py (2 hunks)
- tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py
5-5: Module level import not at top of file (E402)
6-6: Module level import not at top of file (E402)
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
12-12: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
16-16: Module level import not at top of file (E402)
EventStream/data/pytorch_dataset.py
209-209: Loop control variable
n_events
overrides iterable it iterates (B020)EventStream/data/dataset_base.py
1485-1485: Loop control variable
m
not used within loop body (B007)Rename unused
m
to_m
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1674-1674: Function definition does not bind loop variable
m
(B023)
1674-1674: Function definition does not bind loop variable
key_prefix
(B023)
1678-1678: Function definition does not bind loop variable
m
(B023)
1678-1678: Function definition does not bind loop variable
val_prefix
(B023)
1680-1680: Function definition does not bind loop variable
m
(B023)
1680-1680: Function definition does not bind loop variable
val_prefix
(B023)
1683-1683: Function definition does not bind loop variable
m
(B023)
1683-1683: Function definition does not bind loop variable
val_prefix
(B023)
1684-1684: Function definition does not bind loop variable
m
(B023)
1684-1684: Function definition does not bind loop variable
val_prefix
(B023)
1685-1685: Function definition does not bind loop variable
m
(B023)
1685-1685: Function definition does not bind loop variable
val_prefix
(B023)
1692-1692: Function definition does not bind loop variable
m
(B023)
2039-2039: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (4)
tests/test_e2e_runs.py (2)
52-108
: Review of new method_test_dataset_output
.This method effectively tests the integrity of dataset outputs in Parquet format. It checks that the Deep Learning representations are correctly partitioned across training, tuning, and held-out datasets, and verifies the correct partitioning of subjects among these datasets. The method also ensures that no subjects overlap between these partitions and checks the counts of subjects in each partition.
The detailed comments added in the method enhance the readability and maintainability of the code. However, consider adding a comment explaining the purpose of casting subject IDs to a consistent type in line 79, as this is a subtle yet important step for ensuring data consistency.
135-145
: Ensure consistent handling of dataset paths.The method
build_dataset
andbuild_dataset_from_parquet
both use theroot
variable to resolve paths. Ensure that theroot
variable is consistently set up across different environments to avoid path resolution issues.Verification successful
Ensure consistent handling of dataset paths.
The
root
variable is consistently set up usingrootutils.setup_root
across different environments, as seen in the following files:
sample_data/generate_synthetic_data.py
tests/test_e2e_runs.py
sample_data/examine_synthetic_data.ipynb
This setup ensures that the
root
variable is correctly configured, addressing the path resolution concerns.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the `root` variable is set up correctly across different environments. # Test: Check the setup configuration for `root`. cat rootutils.py | grep 'setup_root'Length of output: 81
Script:
#!/bin/bash # Description: Search for the definition and setup of the `root` variable across the codebase. # Search for the `root` variable definition rg 'root\s*=' # Search for functions or methods that might setup `root` rg 'def\s+\w+\(.*root.*\)'Length of output: 505
EventStream/data/pytorch_dataset.py (1)
474-474
: Include subject ID conditionally.The code correctly includes the
subject_id
in the output dictionary if configured to do so. This approach is good for flexibility and data privacy, allowing the inclusion of sensitive identifiers only when necessary.EventStream/data/dataset_polars.py (1)
714-715
: Potential performance improvement in subject event count collection.Consider using a more direct approach to compute the
subjects_with_no_events
by utilizing set operations on series directly frompolars
, which might be more efficient.
[PERFORMANCE]- subjects_df_subjects = self.subjects_df.select(pl.col("subject_id").cast(pl.Utf8)) - subjects_with_no_events = set(subjects_df_subjects["subject_id"].to_list()) - self.subject_ids + subjects_with_no_events = set(self.subjects_df.get_column("subject_id").cast(pl.Utf8).to_list()) - self.subject_ids
@@ -1367,7 +1367,7 @@ def cache_deep_learning_representation( | |||
NRT_dir = self.config.save_dir / "NRT_reps" | |||
|
|||
shards_fp = self.config.save_dir / "DL_shards.json" | |||
if shards_fp.exists(): | |||
if shards_fp.exists() and not do_overwrite: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using json.load()
instead of json.loads()
with read_text()
.
The current implementation reads the entire file content into memory and then parses it. This can be optimized by directly loading the JSON content using json.load()
, which streams the file content.
- shards = json.loads(shards_fp.read_text())
+ with open(shards_fp, 'r') as file:
+ shards = json.load(file)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if shards_fp.exists() and not do_overwrite: | |
if shards_fp.exists() and not do_overwrite: | |
with open(shards_fp, 'r') as file: | |
shards = json.load(file) |
@@ -2,6 +2,7 @@ | |||
|
|||
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True) | |||
|
|||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorder module-level imports to the top of the file.
To comply with PEP 8, module-level imports should be at the top of the file.
+import json
Committable suggestion was skipped due to low confidence.
Tools
Ruff
5-5: Module level import not at top of file (E402)
@@ -461,7 +461,7 @@ def _seeded_getitem(self, idx: int) -> dict[str, list[float]]: | |||
|
|||
subject_id, st, end = self.index[idx] | |||
|
|||
shard = self.subj_map[subject_id] | |||
shard = self.subj_map[str(subject_id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the handling of subject ID.
The subject_id
is being accessed as a string, which might not be consistent with other parts of the code where it could be treated as an integer or another type. This inconsistency can lead to errors or unexpected behavior.
- shard = self.subj_map[str(subject_id)]
+ shard = self.subj_map[subject_id] # Assuming subject_id is consistently used in its native type across the codebase.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
shard = self.subj_map[str(subject_id)] | |
shard = self.subj_map[subject_id] # Assuming subject_id is consistently used in its native type across the codebase. |
EventStream/data/dataset_polars.py
Outdated
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count")) | ||
# here we cast to str to avoid issues with the subject_id column being various other types as we | ||
# will eventually JSON serialize it. | ||
n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)) | ||
self.n_events_per_subject = { | ||
subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor subject event properties calculation for clarity and efficiency.
The calculation of n_events_per_subject
can be streamlined by using direct dictionary construction from n_events
dataframe, which avoids the need for zipping and iterating over the series.
- n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8))
- self.n_events_per_subject = {
- subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"])
- }
+ self.n_events_per_subject = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)).to_dict(as_series=False)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count")) | |
# here we cast to str to avoid issues with the subject_id column being various other types as we | |
# will eventually JSON serialize it. | |
n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)) | |
self.n_events_per_subject = { | |
subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"]) | |
} | |
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count")) | |
# here we cast to str to avoid issues with the subject_id column being various other types as we | |
# will eventually JSON serialize it. | |
self.n_events_per_subject = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)).to_dict(as_series=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- EventStream/data/pytorch_dataset.py (2 hunks)
- tests/data/test_pytorch_dataset.py (1 hunks)
Additional context used
Ruff
EventStream/data/pytorch_dataset.py
209-209: Loop control variable
n_events
overrides iterable it iterates (B020)
Additional comments not posted (2)
tests/data/test_pytorch_dataset.py (1)
323-323
: Convert subject IDs to strings in shards dictionary.The conversion of
subject_id
to a string ensures consistency with the expected data types in other parts of the application. This change aligns with modifications in other files wheresubject_id
is treated as a string to maintain uniformity across the codebase.EventStream/data/pytorch_dataset.py (1)
464-473
: Ensure consistent handling ofsubject_id
as a string.The conversion of
subject_id
to a string before checking insubj_map
ensures consistency with the new data type expectations. This change is crucial to avoid type mismatches which can lead to runtime errors. However, ensure that all parts of the application that interact withsubject_id
are updated to handle it as a string to maintain uniformity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- EventStream/data/dataset_base.py (1 hunks)
- tests/test_e2e_runs.py (5 hunks)
Files skipped from review as they are similar to previous changes (2)
- EventStream/data/dataset_base.py
- tests/test_e2e_runs.py
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #117 +/- ##
==========================================
- Coverage 86.56% 86.06% -0.50%
==========================================
Files 29 29
Lines 6556 6387 -169
==========================================
- Hits 5675 5497 -178
- Misses 881 890 +9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (7)
EventStream/baseline/FT_task_baseline.py (4)
Line range hint
50-81
: Refactor theload_flat_rep
function to improve readability and maintainability.The function has a complex logic that could benefit from being broken down into smaller, more manageable functions. This would improve readability and maintainability.
+ def _load_params(self, flat_dir): + """Helper function to load parameters.""" + params_fp = flat_dir / "params.json" + if not params_fp.is_file(): + raise FileNotFoundError("Flat representation files haven't been written!") + with open(params_fp) as f: + return json.load(f) + - params_fp = flat_dir / "params.json" - if not params_fp.is_file(): - raise FileNotFoundError("Flat representation files haven't been written!") - with open(params_fp) as f: - params = json.load(f)
Line range hint
254-254
: Useisinstance()
instead of comparing types directly.Direct type comparison is not recommended. Use
isinstance()
for checking object types to ensure compatibility with inheritance.- if type(window_sizes) is not list: + if not isinstance(window_sizes, list):
Line range hint
377-377
: Consider defining abstract methods inBaseSklearnModuleConfig
.As
BaseSklearnModuleConfig
is an abstract base class, it should define at least one abstract method to justify its abstract nature.+ @abstractmethod + def example_method(self): + """Example abstract method.""" + pass
Line range hint
553-553
: Adjust conditional expression for clarity.Replace the Yoda condition with a more conventional expression to improve readability.
- case float() as frac_subjects if 0 < frac_subjects and frac_subjects < 1: + case float() as frac_subjects if frac_subjects > 0 and frac_subjects < 1:EventStream/data/dataset_polars.py (3)
Line range hint
203-203
: Useisinstance()
instead of comparing types directly.Direct type comparison using
type()
is not recommended as it can lead to errors if subclassing is involved. Useisinstance()
for type checks to ensure proper handling of polymorphism.- if type(qq) is Path: + if isinstance(qq, Path):
Line range hint
1683-1694
: Ensure loop variables are correctly bound in lambda functions.Several lambda functions in the code do not bind the loop variables correctly, leading to potential bugs where the lambda captures the variable from the outer scope instead of the loop iteration. This can be fixed by defaulting the loop variable in the lambda's argument list.
- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"), + .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),Also applies to: 1701-1701
Line range hint
2048-2048
: Optimize dictionary key access.Use direct dictionary key checks instead of calling
.keys()
method, which is unnecessary and less efficient.- if feature not in cfg.vocabulary.idxmap.keys(): + if feature not in cfg.vocabulary.idxmap:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- EventStream/baseline/FT_task_baseline.py (5 hunks)
- EventStream/data/dataset_polars.py (7 hunks)
- scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
EventStream/baseline/FT_task_baseline.py
254-254: Do not compare types, use
isinstance()
(E721)
377-377:
BaseSklearnModuleConfig
is an abstract base class, but it has no abstract methods (B024)
405-405: No explicit
stacklevel
keyword argument found (B028)
553-553: Yoda conditions are discouraged, use
frac_subjects > 0
instead (SIM300)Replace Yoda condition with
frac_subjects > 0
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1683-1683: Function definition does not bind loop variable
m
(B023)
1683-1683: Function definition does not bind loop variable
key_prefix
(B023)
1687-1687: Function definition does not bind loop variable
m
(B023)
1687-1687: Function definition does not bind loop variable
val_prefix
(B023)
1689-1689: Function definition does not bind loop variable
m
(B023)
1689-1689: Function definition does not bind loop variable
val_prefix
(B023)
1692-1692: Function definition does not bind loop variable
m
(B023)
1692-1692: Function definition does not bind loop variable
val_prefix
(B023)
1693-1693: Function definition does not bind loop variable
m
(B023)
1693-1693: Function definition does not bind loop variable
val_prefix
(B023)
1694-1694: Function definition does not bind loop variable
m
(B023)
1694-1694: Function definition does not bind loop variable
val_prefix
(B023)
1701-1701: Function definition does not bind loop variable
m
(B023)
2048-2048: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (2)
EventStream/data/dataset_polars.py (2)
693-693
: Setsubject_id_dtype
after checking ifevents_df
is notNone
.The assignment of
self.subject_id_dtype
should be inside the conditional check to ensureevents_df
is notNone
before accessing its schema.- self.subject_id_dtype = self.events_df.schema["subject_id"] - if self.events_df is not None: + if self.events_df is not None: + self.subject_id_dtype = self.events_df.schema["subject_id"]
740-753
: Improve type conversion and error handling in_filter_col_inclusion
.The conversion of inclusion targets to the appropriate data type can be simplified by using
polars
built-in functions, and the error message can be enhanced for better clarity.- logger.debug( - f"Converting inclusion targets of type {type(list(incl_targets)[0])} for " - f"{col} to {df.schema[col]}" - ) - if isinstance(list(incl_targets)[0], str): - incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8) - else: - incl_list = pl.Series(list(incl_targets), dtype=df.schema[col]) - incl_list = incl_list.cast(df.schema[col]) - logger.debug( - f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}" - ) + try: + incl_list = pl.Series(list(incl_targets)).cast(df.schema[col]) + except TypeError as e: + raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e
scripts/build_flat_reps.py
Outdated
cfg = hydra.utils.instantiate(cfg, _convert_="all") | ||
save_dir = Path(cfg.pop("save_dir")) | ||
window_sizes = cfg.pop("window_sizes") | ||
subjects_per_output_file = ( | ||
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None | ||
) | ||
|
||
# Build flat reps for specified task and window sizes | ||
ESD = Dataset.load(save_dir) | ||
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params( | ||
feature_inclusion_frequency=None, include_only_measurements=None | ||
) | ||
cache_kwargs = dict( | ||
subjects_per_output_file=subjects_per_output_file, | ||
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1 | ||
window_sizes=window_sizes, | ||
include_only_measurements=include_only_measurements, | ||
do_overwrite=False, | ||
do_update=True, | ||
) | ||
ESD.cache_flat_representation(**cache_kwargs) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper error handling in the main function.
The function uses several pop
operations on the configuration dictionary without checking if the keys exist, which could lead to KeyError
if they are missing. Consider adding default values or error handling to prevent runtime errors.
- save_dir = Path(cfg.pop("save_dir"))
- window_sizes = cfg.pop("window_sizes")
+ save_dir = Path(cfg.get("save_dir", "default_directory"))
+ window_sizes = cfg.get("window_sizes", [])
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
cfg = hydra.utils.instantiate(cfg, _convert_="all") | |
save_dir = Path(cfg.pop("save_dir")) | |
window_sizes = cfg.pop("window_sizes") | |
subjects_per_output_file = ( | |
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None | |
) | |
# Build flat reps for specified task and window sizes | |
ESD = Dataset.load(save_dir) | |
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params( | |
feature_inclusion_frequency=None, include_only_measurements=None | |
) | |
cache_kwargs = dict( | |
subjects_per_output_file=subjects_per_output_file, | |
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1 | |
window_sizes=window_sizes, | |
include_only_measurements=include_only_measurements, | |
do_overwrite=False, | |
do_update=True, | |
) | |
ESD.cache_flat_representation(**cache_kwargs) | |
if __name__ == "__main__": | |
main() | |
cfg = hydra.utils.instantiate(cfg, _convert_="all") | |
save_dir = Path(cfg.get("save_dir", "default_directory")) | |
window_sizes = cfg.get("window_sizes", []) | |
subjects_per_output_file = ( | |
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None | |
) | |
# Build flat reps for specified task and window sizes | |
ESD = Dataset.load(save_dir) | |
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params( | |
feature_inclusion_frequency=None, include_only_measurements=None | |
) | |
cache_kwargs = dict( | |
subjects_per_output_file=subjects_per_output_file, | |
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1 | |
window_sizes=window_sizes, | |
include_only_measurements=include_only_measurements, | |
do_overwrite=False, | |
do_update=True, | |
) | |
ESD.cache_flat_representation(**cache_kwargs) | |
if __name__ == "__main__": | |
main() |
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base") | ||
def main(cfg: DictConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation for the main function.
The new main
function lacks a docstring. Adding a docstring would help other developers understand the purpose and usage of this function.
+ """
+ Main function to build flat representation datasets.
+ Args:
+ cfg (DictConfig): Configuration object containing parameters for dataset processing.
+ """
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base") | |
def main(cfg: DictConfig): | |
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base") | |
def main(cfg: DictConfig): | |
""" | |
Main function to build flat representation datasets. | |
Args: | |
cfg (DictConfig): Configuration object containing parameters for dataset processing. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (2)
EventStream/data/dataset_polars.py (2)
Line range hint
1686-1697
: Bind loop variablem
andval_prefix
inside the lambda function.The loop variable
m
andval_prefix
are not bound inside the lambda function, which may lead to unexpected behavior if the function is executed after the loop has finished iterating. Use a default argument to bind the current value of the variable at each iteration.- .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count") + .name.map(lambda c, m=m, val_prefix=val_prefix: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count")
Line range hint
2049-2049
: Optimize dictionary key existence check.Instead of using
key in dict.keys()
, it's more efficient and pythonic to usekey in dict
.- if feature not in cfg.vocabulary.idxmap.keys(): + if feature not in cfg.vocabulary.idxmap:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- EventStream/data/dataset_polars.py (14 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1686-1686: Function definition does not bind loop variable
m
(B023)
1686-1686: Function definition does not bind loop variable
key_prefix
(B023)
1690-1690: Function definition does not bind loop variable
m
(B023)
1690-1690: Function definition does not bind loop variable
val_prefix
(B023)
1692-1692: Function definition does not bind loop variable
m
(B023)
1692-1692: Function definition does not bind loop variable
val_prefix
(B023)
1695-1695: Function definition does not bind loop variable
m
(B023)
1695-1695: Function definition does not bind loop variable
val_prefix
(B023)
1696-1696: Function definition does not bind loop variable
m
(B023)
1696-1696: Function definition does not bind loop variable
val_prefix
(B023)
1697-1697: Function definition does not bind loop variable
m
(B023)
1697-1697: Function definition does not bind loop variable
val_prefix
(B023)
1704-1704: Function definition does not bind loop variable
m
(B023)
2049-2049: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
@@ -690,6 +690,7 @@ def _agg_by_time(self): | |||
) | |||
|
|||
def _update_subject_event_properties(self): | |||
self.subject_id_dtype = self.events_df.schema["subject_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure events_df
is not None
before accessing its schema.
The assignment of self.subject_id_dtype
should be inside the conditional check to ensure events_df
is not None
before accessing its schema.
- self.subject_id_dtype = self.events_df.schema["subject_id"]
- if self.events_df is not None:
+ if self.events_df is not None:
+ self.subject_id_dtype = self.events_df.schema["subject_id"]
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.subject_id_dtype = self.events_df.schema["subject_id"] | |
if self.events_df is not None: | |
self.subject_id_dtype = self.events_df.schema["subject_id"] |
logger.debug( | ||
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for " | ||
f"{col} to {df.schema[col]}" | ||
) | ||
if isinstance(list(incl_targets)[0], str): | ||
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8) | ||
else: | ||
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col]) | ||
|
||
incl_list = incl_list.cast(df.schema[col]) | ||
|
||
logger.debug( | ||
f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve type conversion and error handling in _filter_col_inclusion
.
The conversion of inclusion targets to the appropriate data type can be simplified by using polars
built-in functions, and the error message can be enhanced for better clarity.
- logger.debug(
- f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
- f"{col} to {df.schema[col]}"
- )
- if isinstance(list(incl_targets)[0], str):
- incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
- else:
- incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])
- incl_list = incl_list.cast(df.schema[col])
- logger.debug(
- f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}"
- )
+ try:
+ incl_list = pl.Series(list(incl_targets)).cast(df.schema[col])
+ except TypeError as e:
+ raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
logger.debug( | |
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for " | |
f"{col} to {df.schema[col]}" | |
) | |
if isinstance(list(incl_targets)[0], str): | |
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8) | |
else: | |
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col]) | |
incl_list = incl_list.cast(df.schema[col]) | |
logger.debug( | |
f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}" | |
) | |
try: | |
incl_list = pl.Series(list(incl_targets)).cast(df.schema[col]) | |
except TypeError as e: | |
raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (3)
EventStream/data/dataset_polars.py (3)
Line range hint
203-203
: Refactor to useisinstance
for type checking.Using
isinstance()
instead of comparing types directly is more Pythonic and supports subclass checking.- if type(list(incl_targets)[0]) is str: + if isinstance(list(incl_targets)[0], str):
Line range hint
1683-1694
: Function definition does not bind loop variablem
andval_prefix
.The loop variable
m
andval_prefix
are not bound in the lambda function, which might lead to unexpected behavior due to late binding.- .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/count"), + .name.map(lambda c, m=m, val_prefix=val_prefix: f"dynamic/{m}/{c.replace(val_prefix, '')}/count"),
Line range hint
2046-2046
: Usekey in dict
instead ofkey in dict.keys()
.It is more efficient and Pythonic to check for key existence directly in the dictionary.
- if feature not in cfg.vocabulary.idxmap.keys(): + if feature not in cfg.vocabulary.idxmap:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- EventStream/data/dataset_polars.py (14 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py
203-203: Do not compare types, use
isinstance()
(E721)
1683-1683: Function definition does not bind loop variable
m
(B023)
1683-1683: Function definition does not bind loop variable
key_prefix
(B023)
1687-1687: Function definition does not bind loop variable
m
(B023)
1687-1687: Function definition does not bind loop variable
val_prefix
(B023)
1689-1689: Function definition does not bind loop variable
m
(B023)
1689-1689: Function definition does not bind loop variable
val_prefix
(B023)
1692-1692: Function definition does not bind loop variable
m
(B023)
1692-1692: Function definition does not bind loop variable
val_prefix
(B023)
1693-1693: Function definition does not bind loop variable
m
(B023)
1693-1693: Function definition does not bind loop variable
val_prefix
(B023)
1694-1694: Function definition does not bind loop variable
m
(B023)
1694-1694: Function definition does not bind loop variable
val_prefix
(B023)
1701-1701: Function definition does not bind loop variable
m
(B023)
2046-2046: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (2)
EventStream/data/dataset_polars.py (2)
693-693
: Ensureevents_df
is notNone
before accessing its schema.The assignment of
self.subject_id_dtype
should be inside the conditional check to ensureevents_df
is notNone
before accessing its schema.
740-753
: Improve type conversion and error handling in_filter_col_inclusion
.The conversion of inclusion targets to the appropriate data type can be simplified by using
polars
built-in functions, and the error message can be enhanced for better clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
EventStream/data/dataset_base.py (1)
Line range hint
1488-1488
: Rename unused loop variable.The loop control variable
m
is not used within the loop body. To avoid confusion, consider renaming it to_m
.- for m, cfg in self.measurement_configs.items(): + for _m, cfg in self.measurement_configs.items():
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
configs/dataset_base.yaml
is excluded by!**/*.yaml
Files selected for processing (2)
- EventStream/data/dataset_base.py (2 hunks)
- scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
EventStream/data/dataset_base.py
1488-1488: Loop control variable
m
not used within loop bodyRename unused
m
to_m
(B007)
Additional comments not posted (9)
scripts/build_flat_reps.py (5)
4-17
: Imports LGTM!The imports for
stackprinter
,hydra
,DictConfig
, andDataset
are correctly included and necessary for the script's functionality.
19-20
: Add documentation for the main function.The new
main
function lacks a docstring. Adding a docstring would help other developers understand the purpose and usage of this function.+ """ + Main function to build flat representation datasets. + Args: + cfg (DictConfig): Configuration object containing parameters for dataset processing. + """[EXISTING_COMMENT]
22-23
: Ensure proper error handling in the main function.The function uses several
pop
operations on the configuration dictionary without checking if the keys exist, which could lead toKeyError
if they are missing. Consider adding default values or error handling to prevent runtime errors.- save_dir = Path(cfg.pop("save_dir")) - window_sizes = cfg.pop("window_sizes") + save_dir = Path(cfg.get("save_dir", "default_directory")) + window_sizes = cfg.get("window_sizes", [])[EXISTING_COMMENT]
41-41
: Function call LGTM!The
cache_flat_representation
function is called with the necessary keyword arguments. Ensure that the function implementation handles these arguments correctly.
44-45
: Main function call LGTM!The main function is correctly called within the
if __name__ == "__main__":
block.EventStream/data/dataset_base.py (4)
1242-1247
: Changes LGTM!The changes to handle parameter files more robustly in the
cache_flat_representation
function are correctly implemented. Ensure that the function handles potential errors and includes necessary documentation.
1373-1390
: Optimize JSON file handling.Consider using
json.load()
instead ofjson.loads()
withread_text()
. This can be optimized by directly loading the JSON content usingjson.load()
, which streams the file content.- shards = json.loads(shards_fp.read_text()) + with open(shards_fp, 'r') as file: + shards = json.load(file)[EXISTING_COMMENT]
Line range hint
114-116
: Changes LGTM!The assignment for
self.subject_id_dtype
in the_update_subject_event_properties
function is correctly implemented.
1242-1247
: Changes LGTM!The changes to handle parameter files more robustly in the
cache_flat_representation
function are correctly implemented. Ensure that the function handles potential errors and includes necessary documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- EventStream/data/dataset_base.py (5 hunks)
Files skipped from review as they are similar to previous changes (1)
- EventStream/data/dataset_base.py
STILL IN PROGRESS. We just have failing test cases for now.