Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dataloader): add support in dataloader for csv and parquet #1451

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

ArslanSaleem
Copy link
Collaborator

@ArslanSaleem ArslanSaleem commented Dec 6, 2024

Important

Add CSV and Parquet format support to DatasetLoader, update schema handling, and enhance YAML template metadata.

  • Behavior:
    • Add support for csv and parquet formats in DatasetLoader in loader.py.
    • Remove _read_cache() and integrate its functionality into _read_csv_or_parquet().
    • Update _load_from_source() to handle csv and parquet source types.
  • Schema Handling:
    • Use get() for table and update_frequency in loader.py to handle missing keys.
    • Update _create_yml_template() in base.py to include destination metadata.
  • Misc:
    • Remove redundant metadata variable in _create_yml_template() in base.py.

This description was created by Ellipsis for 9dfcea4. It will automatically update as commits are pushed.

@ArslanSaleem ArslanSaleem requested a review from gventuri December 6, 2024 17:20
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 6, 2024
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 9dfcea4 in 1 minute and 3 seconds

More details
  • Looked at 145 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. pandasai/data_loader/loader.py:132
  • Draft comment:
    The _read_csv_or_parquet method is a good refactor to handle both CSV and Parquet formats. It replaces the previous _read_cache method, reducing code duplication. Ensure that the format parameter is always correctly derived from the schema to avoid runtime errors.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The PR introduces a new method _read_csv_or_parquet to handle reading both CSV and Parquet files. This method is used in multiple places, replacing the previous _read_cache method. The _read_cache method has been removed, which is a good refactor to avoid code duplication. However, the _read_csv_or_parquet method is called with a format parameter, which is derived from the schema. This parameter is used to determine the file format to read. The method should handle cases where the format is not supported, which it does by raising a ValueError. This is a good practice to ensure that unsupported formats are not processed. The refactor seems to be well-implemented, and the logic is consistent with the intended functionality.
2. pandasai/data_loader/loader.py:156
  • Draft comment:
    The _load_from_source method now directly handles CSV and Parquet files, which aligns with the PR's intent. Ensure the schema includes correct file paths and types for this logic to work as expected.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The _load_from_source method now includes logic to handle CSV and Parquet files directly, which aligns with the PR's intent to support these formats. The method constructs the file path and calls _read_csv_or_parquet. This is a straightforward and efficient way to handle these file types. The logic is clear and should work as intended, provided the schema is correctly set up with the necessary file paths and types.
3. pandasai/data_loader/loader.py:38
  • Draft comment:
    The load method now uses get with a default value for table, avoiding KeyErrors. This change is applied consistently in the method.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The load method in loader.py has been updated to handle cases where the table key might not be present in the schema. It uses the get method with a default value, which is a good practice to avoid KeyErrors. This change is applied in two places, ensuring consistency in how the table name is derived.
4. pandasai/data_loader/loader.py:95
  • Draft comment:
    The _is_cache_valid method now safely retrieves update_frequency using get, preventing KeyErrors if the key is missing.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The _is_cache_valid method now uses get to retrieve update_frequency, which is a safer approach to avoid KeyErrors if the key is missing. This change ensures that the method can handle cases where update_frequency is not specified in the schema.

Workflow ID: wflow_15zevemLVA1BTR2r


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@gventuri gventuri merged commit 091ea8e into release/v3 Dec 6, 2024
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants