fix(dataloader): add support in dataloader for csv and parquet #1451

ArslanSaleem · 2024-12-06T17:20:31Z

Important

Add CSV and Parquet format support to DatasetLoader, update schema handling, and enhance YAML template metadata.

Behavior:
- Add support for csv and parquet formats in DatasetLoader in loader.py.
- Remove _read_cache() and integrate its functionality into _read_csv_or_parquet().
- Update _load_from_source() to handle csv and parquet source types.
Schema Handling:
- Use get() for table and update_frequency in loader.py to handle missing keys.
- Update _create_yml_template() in base.py to include destination metadata.
Misc:
- Remove redundant metadata variable in _create_yml_template() in base.py.

^{This description was created by}^{for 9dfcea4. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 9dfcea4 in 1 minute and 3 seconds

More details

1. pandasai/data_loader/loader.py:132

Draft comment:
The _read_csv_or_parquet method is a good refactor to handle both CSV and Parquet formats. It replaces the previous _read_cache method, reducing code duplication. Ensure that the format parameter is always correctly derived from the schema to avoid runtime errors.
Reason this comment was not posted:
Confidence changes required: 33%
The PR introduces a new method _read_csv_or_parquet to handle reading both CSV and Parquet files. This method is used in multiple places, replacing the previous _read_cache method. The _read_cache method has been removed, which is a good refactor to avoid code duplication. However, the _read_csv_or_parquet method is called with a format parameter, which is derived from the schema. This parameter is used to determine the file format to read. The method should handle cases where the format is not supported, which it does by raising a ValueError. This is a good practice to ensure that unsupported formats are not processed. The refactor seems to be well-implemented, and the logic is consistent with the intended functionality.

2. pandasai/data_loader/loader.py:156

Draft comment:
The _load_from_source method now directly handles CSV and Parquet files, which aligns with the PR's intent. Ensure the schema includes correct file paths and types for this logic to work as expected.
Reason this comment was not posted:
Confidence changes required: 33%
The _load_from_source method now includes logic to handle CSV and Parquet files directly, which aligns with the PR's intent to support these formats. The method constructs the file path and calls _read_csv_or_parquet. This is a straightforward and efficient way to handle these file types. The logic is clear and should work as intended, provided the schema is correctly set up with the necessary file paths and types.

3. pandasai/data_loader/loader.py:38

Draft comment:
The load method now uses get with a default value for table, avoiding KeyErrors. This change is applied consistently in the method.
Reason this comment was not posted:
Confidence changes required: 33%
The load method in loader.py has been updated to handle cases where the table key might not be present in the schema. It uses the get method with a default value, which is a good practice to avoid KeyErrors. This change is applied in two places, ensuring consistency in how the table name is derived.

4. pandasai/data_loader/loader.py:95

Draft comment:
The _is_cache_valid method now safely retrieves update_frequency using get, preventing KeyErrors if the key is missing.
Reason this comment was not posted:
Confidence changes required: 33%
The _is_cache_valid method now uses get to retrieve update_frequency, which is a safer approach to avoid KeyErrors if the key is missing. This change ensures that the method can handle cases where update_frequency is not specified in the schema.

Workflow ID: wflow_15zevemLVA1BTR2r

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

fix(dataloader): add support in dataloader for csv and parquet

9dfcea4

ArslanSaleem requested a review from gventuri December 6, 2024 17:20

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 6, 2024

ellipsis-dev bot reviewed Dec 6, 2024

View reviewed changes

gventuri merged commit 091ea8e into release/v3 Dec 6, 2024
0 of 6 checks passed

Provide feedback