Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.2 #535

Merged
merged 19 commits into from
Sep 13, 2023
Merged

Version 1.2 #535

merged 19 commits into from
Sep 13, 2023

Conversation

gventuri
Copy link
Collaborator

@gventuri gventuri commented Sep 7, 2023

Summary by CodeRabbit

  • New Feature: Introduced SQLConnector, MySQLConnector, and PostgreSQLConnector classes for connecting to SQL databases.
  • New Feature: Added YahooFinanceConnector class for retrieving stock data from Yahoo Finance.
  • Refactor: Updated SmartDataframe class to improve modularity and maintainability. Introduced a new SmartDataframeCore class that handles the loading and management of the dataframe.
  • Refactor: Enhanced CodeManager class with additional methods for better code execution and filter extraction.
  • Test: Expanded test coverage for SQLConnector, CodeManager, and SmartDataframe classes.
  • Chore: Updated logging and error handling across various classes for improved debugging and user experience.

gventuri and others added 13 commits August 7, 2023 13:57
* fix: setters were not overriding some properties in some cases

* fix: remove multiple df overwrites (#436)

* improved on _is_df_overwrite method to check all targets

* test: add test for multiple df overwrite

---------

Co-authored-by: Gabriele Venturi <[email protected]>

* fix: prevent to_* pandas methods from running

* Release v0.8.3

* fix: environment for executing code (#419)

* fix: environment for executing code

* (fix): set `matplot.pyplot` package under "plt" alias in
  `_get_environment()`

* feat: Try to import a package when NameError (#430)

* (refactor): split calling of `exec()` for code and handling of errors
  into separate methods;
* (feat): add trying to import the `exc.name` and add one to context
  when `exc` is an instance of `NameError`;
* (tests): add tests for new methods;

* fix: add logging for `NameError` handling (#430)

* (fix): add log a general exception raised when trying to fix
  `NameError`

* fix: compatibility with python 3.9 (#430)

* (fix): update `handle_error()` method, add parsing `exc.args[0]` for
  cases with NameError if there is no attribute "name" of object of the
  exception.

* fix: update for `handle_error()`(#430)

* (fix): add checking if the library to be imported is preset in the
  whitelist of third-packages;
* (fix): remove "plt" when forming environment for code to be run;
* (fix): add raising an exception that was caught during fixing code
  with the error correction framework;
* (tests): add test-case for retrying to fix code with the error
  correction framework unsuccessfully;
* (tests): fix issue with inappropriate instantiation of numpy's `array`
  object within several test methods;

* Release v0.8.4

* Release v1.0b1

* Release v1.0

* fix: convert polars to pandas df and back to polars after exec the code

* docs: update docs for v1

* Release v1.0.1

* fix: remove shortcuts from backwords compatibile version

* Release v1.0.2

* refactor: minor updates (#446)

* (refactor): inappropriate type hint for `_has_run` in base class for
  middlewares;
* (style): remove built-ins name shadowing in `__init__()` method of
  `Prompt` class;

* docs: update LLMs documentation code to match the last version (#445)

I have updated OpenAI, Starcoder, Falcon, and AzureOpenAI.

But I haven't edited GooglePalm because it still uses api_key in the last version not api_token

Co-authored-by: Gabriele Venturi <[email protected]>

* fix: do not override setters and getters with __getattr__

* Release v1.0.3

* fix: huggingface models (#454)

* Release v1.0.4

* fix: update code manager to convert prompt id to string (#462)

* test: increase test coverage (#460)

* Update LLMs documentation code to match the last version

I have updated OpenAI, Starcoder, Falcon, and AzureOpenAI.

But I haven't edited GooglePalm because it still uses api_key in the last version not api_token

* Create test_google_vertexai.py

Create test cases for Class Google Vertexai. I have tested initailization function with default model provided and custom one as well.
Then I have built test cases for validation of class when model provided and without model.

* Update test_smartdataframe.py

Created test cases for _load_df() and _import_from_file() functions

* test: remove verbose test

* test: unskip skipped test

* test: test multiple invalid file formats

* test: fix tests

---------

Co-authored-by: Gabriele Venturi <[email protected]>

* fix: recursive error on df last_result property (#468) (#469)

* fix: recursive error on df last_result property (#468)

* fix: change last_result type to str

* test: add tests

---------

Co-authored-by: Gabriele Venturi <[email protected]>

* Release v1.0.5

* fix(llms): AzureOpenAI call method signature (#476)

* fix: added last_error property reference in smart_datalake (#475)

* fix: temporarely disable multi-turn capacity, causing hallucinations

* Release v1.0.6

* fix: create required folders as the datalake is initialized

* Release v1.0.7

* fix: create required folders in the project folder

* Release v1.0.8

* chore: fix typing in the logger logs (#464)

* docs: add video explanation for SmartDataframe

* feat: extracting filters (#451)

* (feat): add `_extract_filters()`, `_extract_comparisons()`,
  `_tokenize_operand()` methods to `CodeManager`;
* (tests): add basic tests;

* test: extracting filters (#451)

* (tests): rename `test_extract_filters()` to
  `test_extract_filters_col_index()`, update asserting according to
  task requirements;
* (tests): add `test_extract_filters_col_index_multiple_df()` for
  multiple dfs;

* fix: update according to requirements (#451)

* (feat): add module `node_visitor.py`, add `AssignmentVisitor` in
  `node_visitors.py`;
* (fix): `_extract_filters()` now returns dictionary;
* (feat): add grouping by df number when forming;
* (tests): update naming in test cases;
* (tests): add test method with non default (not `df`) variable name;

* feat: extracting filters (#451)

* (feat): add `CallVisitor` to `node_visitors.py` module;
* (feat): add extracting comparisions in `filter()` method being called
  as an attribute of object named `pl` or `polars`;
* (tests): test methods for extracting polars-like filters;

* refactor: extracting filters (#451)

* (refactor): add handling exceptions in `_extract_filters()` method;

* docs: extracting filters (#451)

* (docs): add documentations for implemented functions;
* (style): type hinting;

* test: merge similar tests

---------

Co-authored-by: Gabriele Venturi <[email protected]>
Co-authored-by: Aymane Hachcham <[email protected]>
Co-authored-by: Omar Elsherif <[email protected]>
Co-authored-by: Hanchung Lee <[email protected]>
Co-authored-by: Sanchit Bhavsar <[email protected]>
Co-authored-by: Massimiliano Pronesti <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 7, 2023

Walkthrough

The changes introduce a new architecture for managing dataframes and connectors, improve code execution logic, enhance error handling, and add comprehensive unit tests. The updates also include better logging, caching, and memory management. The modifications aim to increase modularity, maintainability, and test coverage.

Changes

File(s) Summary
pandasai/connectors/... Introduced various connectors (SQL, MySQL, PostgreSQL, Yahoo Finance) for connecting to databases and external APIs.
pandasai/helpers/code_manager.py Enhanced code execution logic with methods for extracting filters, validating imports, and tracking executed code.
pandasai/smart_dataframe/... Implemented a new architecture for managing dataframes with a core class that handles loading and management.
pandasai/smart_datalake/... Improved logging, caching, and memory management. Updated attribute names for clarity.
tests/... Added comprehensive unit tests for the new classes and methods. Updated existing tests to reflect changes in the codebase.

"In the land of code where the shadows lie,
A rabbit hopped forth under the cloudless sky.
With a carrot of wisdom and a lettuce leaf of wit,
He weaved magic into every commit. 🥕🍃💻
Through fields of variables and streams of data flow,
He danced with functions, high and low.
In the realm of bugs, he was the knight,
Turning darkness of errors into light. 🐞🛡️✨
So here's to the rabbit, our hero unsung,
Celebrating the code that he has spun!" 🐇🎉🏆


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2023

Codecov Report

Merging #535 (f5c4be0) into main (9e6a71a) will decrease coverage by 3.75%.
The diff coverage is 70.45%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #535      +/-   ##
==========================================
- Coverage   87.14%   83.39%   -3.75%     
==========================================
  Files          47       52       +5     
  Lines        2030     2584     +554     
==========================================
+ Hits         1769     2155     +386     
- Misses        261      429     +168     
Files Changed Coverage Δ
pandasai/prompts/correct_error_prompt.py 100.00% <ø> (ø)
pandasai/smart_dataframe/abstract_df.py 52.29% <7.14%> (+0.86%) ⬆️
pandasai/prompts/base.py 93.33% <50.00%> (ø)
pandasai/connectors/sql.py 59.63% <59.63%> (ø)
pandasai/smart_dataframe/__init__.py 80.64% <73.06%> (-12.45%) ⬇️
pandasai/helpers/df_config_manager.py 86.66% <75.00%> (-3.75%) ⬇️
pandasai/connectors/yahoo_finance.py 79.10% <79.10%> (ø)
pandasai/connectors/base.py 86.00% <86.00%> (ø)
pandasai/smart_datalake/__init__.py 89.32% <86.36%> (-0.05%) ⬇️
pandasai/helpers/code_manager.py 90.76% <90.32%> (-0.33%) ⬇️
... and 4 more

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 34

Commits Files that changed from the base of the PR and between 9e6a71a and 9955000 commits.
Files ignored due to filter (2)
  • poetry.lock
  • pyproject.toml
Files selected for processing (23)
  • examples/from_sql.py (1 hunks)
  • pandasai/connectors/init.py (1 hunks)
  • pandasai/connectors/base.py (1 hunks)
  • pandasai/connectors/sql.py (1 hunks)
  • pandasai/connectors/yahoo_finance.py (1 hunks)
  • pandasai/helpers/init.py (2 hunks)
  • pandasai/helpers/code_manager.py (9 hunks)
  • pandasai/helpers/df_config_manager.py (7 hunks)
  • pandasai/helpers/df_info.py (1 hunks)
  • pandasai/helpers/node_visitors.py (1 hunks)
  • pandasai/prompts/base.py (1 hunks)
  • pandasai/prompts/correct_error_prompt.py (3 hunks)
  • pandasai/schemas/df_config.py (1 hunks)
  • pandasai/smart_dataframe/init.py (8 hunks)
  • pandasai/smart_dataframe/abstract_df.py (1 hunks)
  • pandasai/smart_datalake/init.py (17 hunks)
  • tests/connectors/test_base.py (1 hunks)
  • tests/connectors/test_sql.py (1 hunks)
  • tests/prompts/test_correct_error_prompt.py (2 hunks)
  • tests/test_codemanager.py (3 hunks)
  • tests/test_pandasai.py (1 hunks)
  • tests/test_smartdataframe.py (10 hunks)
  • tests/test_smartdatalake.py (4 hunks)
Files skipped from review due to trivial changes (6)
  • pandasai/connectors/init.py
  • pandasai/helpers/df_info.py
  • pandasai/prompts/base.py
  • pandasai/prompts/correct_error_prompt.py
  • pandasai/schemas/df_config.py
  • tests/test_smartdataframe.py
Additional comments (Suppressed): 50
examples/from_sql.py (1)
  • 43-43: The question "How many people from the United states?" seems unrelated to the data being queried (loans and payments). Ensure that the chat query aligns with the data context.
pandasai/helpers/__init__.py (2)
  • 5-11: The import for Logger has been added, but the corresponding import for get_openai_callback seems to be missing in this hunk. Please ensure that it's imported correctly.
+ from .logger import Logger
- from .get_openai_callback import get_openai_callback
  • 16-20: The addition of Logger to the __all__ list is correct and aligns with the new import.
pandasai/helpers/node_visitors.py (1)
  • 1-19: The AssignmentVisitor and CallVisitor classes are well implemented. They both inherit from ast.NodeVisitor and override the appropriate visit methods to collect assignment and call nodes respectively. The use of self.generic_visit(node) ensures that all child nodes are visited.
pandasai/connectors/sql.py (4)
  • 41-48: The password is directly embedded into the connection string, which could potentially expose sensitive information in logs or error messages. Consider using URL encoding for the password or use SQLAlchemy's create_engine method that accepts the username and password as separate parameters.

  • 165-176: The cache invalidation strategy based on time might not work correctly if the data in the database changes frequently. Consider using a more robust invalidation strategy like listening for changes in the database or using a hash of the data.

  • 296-313: The hash of the columns is calculated by concatenating the column names and conditions. This can lead to collisions if different columns have similar names. Consider using a more robust hashing algorithm that takes into account the order and number of columns.

  • 345-354: Environment variables are used as fallbacks for the configuration parameters. This can lead to unexpected behavior if the environment variables are set but the corresponding parameters are not included in the configuration. Make sure to document this behavior clearly.

pandasai/connectors/base.py (1)
  • 63-70: The method set_additional_filters accepts a dictionary as filters but the class variable _additional_filters is a list of lists. There seems to be a mismatch in types here. Please verify if this is intended or if there's an error.
pandasai/helpers/df_config_manager.py (9)
  • 36-40: The change from self.name to self._sdf.table_name is noted. Ensure that all instances of self.name have been replaced with self._sdf.table_name throughout the codebase.

  • 47-51: The change from self.name to self._sdf.table_name in the _check_for_duplicates method is noted. Make sure that this change does not affect other parts of the codebase where this method is called.

  • 58-73: The addition of handling connectors and checking for .parquet files and Google Sheets links is a good improvement. However, ensure that the new checks are compatible with the rest of the codebase and that they do not introduce any unexpected behavior.

  • 79-88: The change from self._sdf.original to self._sdf.dataframe in the _get_import_path method is noted. Make sure that this change does not affect other parts of the codebase where this method is called.

  • 92-101: The addition of the name argument in the save method is a good improvement. However, ensure that this change is reflected wherever this method is called in the codebase.

  • 118-124: The changes in the dictionary keys from self.description and self.head_csv to self._sdf.table_description and self._sdf.head_csv respectively are noted. Ensure that these changes are reflected wherever this dictionary is used in the codebase.

  • 156-162: The change from self._sdf.name to self._sdf.table_name in the name property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.

  • 164-166: The change from self._sdf.description to self._sdf.table_description in the description property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.

  • 168-170: The change from self._sdf.original_import to self._sdf._original_import in the original_import property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.

pandasai/helpers/code_manager.py (9)
  • 1-11: The import of defaultdict from the collections module and the AssignmentVisitor, CallVisitor classes from the .node_visitors module are new additions. Ensure that these modules exist and are accessible in the project's structure.

  • 15-21: The addition of Generator to the imported types from the typing module is a new change. Make sure that this type is used appropriately in the code.

  • 28-45: The _ast_comparatos_map dictionary has been added as a class attribute. This dictionary maps AST comparison nodes to their string equivalents, which will be useful for parsing comparisons from the code. The _current_code_executed attribute has also been added to keep track of the current code being executed.

  • 158-181: A new method _required_dfs has been introduced. This method determines which dataframes are required to execute a given piece of code by checking if the dataframe index is referenced in the code. This can help avoid unnecessary dataframe loading if the dataframe isn't needed for the code execution.

  • 200-206: The _current_code_executed attribute is now set to the code being executed before the middlewares are applied. This allows the current code being executed to be tracked throughout the execution process.

  • 225-233: The _required_dfs method is now called before executing the code to determine which dataframes are needed. The resulting list of required dataframes is then passed to the _get_samples method to get sample dataframes for execution. This helps optimize the code execution by only loading the necessary dataframes.

  • 242-289: The _get_samples and _get_originals methods have been introduced. These methods return samples and original versions of the dataframes respectively. In _get_originals, if a dataframe has a connector, additional filters extracted from the current code are set on the connector and the connector is loaded. This ensures that only the necessary data is loaded from the connector based on the filters.

  • 295-302: The environment now includes the pandas library (pd) and any additional dependencies that have been imported. The dfs variable is no longer included in the environment here, as it is now handled separately with the _get_samples method.

  • 467-688: Several new methods and a staticmethod have been added to handle the extraction of filters from the code. These include _get_nearest_func_call, _tokenize_operand, _get_df_id_by_nearest_assignment, _extract_comparisons, and _extract_filters. These methods work together to parse the code, identify filter operations on the dataframes, and extract these filters into a structured format. This allows for more efficient filtering of the dataframes based on the code being executed.

pandasai/smart_dataframe/__init__.py (3)
  • 24-28: The functools.cached_property import is not used in the new hunks. Please ensure that it's necessary and remove if it's not.

  • 114-115: The _import_from_file method should handle different file formats such as CSV, Excel, Parquet etc. The current implementation does not specify what file formats are supported. Please ensure that all necessary file formats are handled.

  • 443-444: The validate method is mentioned but its implementation is not provided in the new hunks. Ensure that it's implemented correctly and handles potential errors.

tests/connectors/test_sql.py (1)
  • 19-29: Hardcoding credentials in the code is a bad practice and poses a security risk. It's better to use environment variables or secure vaults to store sensitive data like usernames, passwords, hostnames etc. Please ensure that this is only for testing purposes and no real credentials are used here.
tests/test_pandasai.py (1)
  • 64-69: Ensure that all calls to the run method throughout the codebase have been updated to match the new input argument from None to 0.
pandasai/smart_datalake/__init__.py (12)
  • 76-83: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 92-96: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 129-133: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 187-192: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 277-281: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 300-309: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 343-345: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 354-354: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 432-432: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 400-403: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 466-467: The logger attribute has been renamed from _logger to logger. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 470-475: The logger attribute has been renamed from _logger to logger. This change also introduces a setter for the logger attribute. Ensure that all references to this attribute throughout the codebase have been updated to match the new name, and that the setter is used appropriately where necessary.

tests/test_codemanager.py (2)
  • 71-72: The reference to the CodeManager instance has been updated from smart_dataframe._dl._code_manager to smart_dataframe.lake._code_manager. Ensure that this change is consistent with the updates made in the SmartDataframe class.

  • 174-179: The assertion assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0]) has been removed. If this assertion is no longer necessary due to changes in the codebase, then this is fine. However, if it was removed by mistake, it should be added back to ensure the correctness of the test.

tests/test_smartdatalake.py (4)
  • 73-74: The attribute name has been changed from datalake to lake. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 118-124: The attribute name has been changed from _dl to lake. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.

  • 134-142: The method name has been changed from _get_head_csv to _get_sample_head. Also, the return type of the method has been changed from a string to a DataFrame. Ensure that all calls to this method throughout the codebase have been updated to match the new signature and handle the new return type correctly.

  • 149-163: The format of the last_prompt string has been changed. This might affect any code that relies on the specific format of this string. Verify that these changes do not break any existing functionality.

examples/from_sql.py Show resolved Hide resolved
examples/from_sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Outdated Show resolved Hide resolved
pandasai/smart_dataframe/__init__.py Outdated Show resolved Hide resolved
tests/connectors/test_sql.py Outdated Show resolved Hide resolved
tests/prompts/test_correct_error_prompt.py Show resolved Hide resolved
tests/prompts/test_correct_error_prompt.py Outdated Show resolved Hide resolved
tests/test_codemanager.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 9

Commits Files that changed from the base of the PR and between 9955000 and 0223cdf commits.
Files selected for processing (4)
  • pandasai/connectors/sql.py (1 hunks)
  • pandasai/helpers/code_manager.py (9 hunks)
  • tests/connectors/test_sql.py (1 hunks)
  • tests/test_codemanager.py (3 hunks)
Additional comments (Suppressed): 13
tests/connectors/test_sql.py (1)
  • 51-58: The _build_query method seems to be building a SQL query string directly from input parameters. This can lead to SQL injection attacks if the inputs are not properly sanitized. Consider using parameterized queries or ORM tools to prevent this.
pandasai/connectors/sql.py (1)
  • 57-57: The __del__ method is not a reliable way to close database connections because it's not guaranteed to be called when the object is garbage collected. It would be better to use a context manager or explicitly close the connection when you're done with it.
tests/test_codemanager.py (2)
  • 71-72: The reference to the CodeManager instance has been updated from smart_dataframe._dl._code_manager to smart_dataframe.lake._code_manager. Ensure that the lake attribute of SmartDataframe correctly points to an instance of SmartDatalake, which contains the _code_manager attribute.

  • 174-179: The test case is checking for the presence of "pd", "plt", and "np" in the environment. However, it seems like the assertion for the dataframe equality with the environment has been removed. If this was intentional and the dataframe check is no longer relevant, then this change is fine. Otherwise, consider adding it back.

+         assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0])
          assert "pd" in code_manager._get_environment()
          assert "plt" in code_manager._get_environment()
          assert "np" in code_manager._get_environment()
pandasai/helpers/code_manager.py (9)
  • 1-11: The import of defaultdict from the collections module and the import of AssignmentVisitor and CallVisitor from .node_visitors are new. Ensure that these modules exist and are accessible from this location.

  • 15-21: The addition of Generator to the imported types from the typing module is new. Ensure that it's used appropriately in the code.

  • 28-45: The addition of _ast_comparatos_map and _current_code_executed attributes are new. Make sure they are used correctly throughout the class.

  • 158-180: The new method _required_dfs is introduced to determine the required DataFrames for executing the code. This is a good approach to avoid unnecessary execution of connectors when not needed.

  • 200-206: The assignment of self._current_code_executed = code is new. Ensure that this attribute is used correctly in the subsequent code.

  • 225-235: The introduction of the _required_dfs method in the execute_code method is a good practice as it helps to avoid unnecessary execution of connectors when not needed.

  • 242-263: The new methods _get_samples and _get_originals are introduced to get samples and original dataframes respectively. These methods seem to be well implemented.

  • 295-302: The removal of "dfs" from the environment dictionary might affect the code execution if "dfs" is referenced in the executed code. Please verify this change.

  • 467-688: The addition of several utility methods and properties such as _get_nearest_func_call, _tokenize_operand, _get_df_id_by_nearest_assignment, _extract_comparisons, _extract_filters and the property middlewares are new. These additions seem to enhance the functionality of the CodeManager class by providing more granular control over code execution, error handling, and AST node processing. However, ensure that these new methods and properties are used correctly in the class.

tests/connectors/test_sql.py Show resolved Hide resolved
tests/connectors/test_sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
pandasai/connectors/sql.py Show resolved Hide resolved
tests/test_codemanager.py Outdated Show resolved Hide resolved
tests/test_codemanager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 16

Commits Files that changed from the base of the PR and between 0223cdf and f5c4be0 commits.
Files selected for processing (4)
  • pandasai/connectors/sql.py (1 hunks)
  • pandasai/smart_dataframe/init.py (8 hunks)
  • tests/connectors/test_yahoo_finance.py (1 hunks)
  • tests/test_codemanager.py (3 hunks)
Files skipped from review due to trivial changes (1)
  • tests/connectors/test_yahoo_finance.py
Additional comments (Suppressed): 6
tests/test_codemanager.py (3)
  • 69-75: The fixture code_manager has been updated to reflect changes in the structure of the SmartDataframe class. Ensure that all tests using this fixture have been updated accordingly.

  • 174-179: The assertion assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0]) has been removed. If this is intentional and the test case still covers the intended functionality, then it's fine. Otherwise, consider adding it back or replacing it with an equivalent assertion.

  • 289-504: These new tests are designed to verify the extraction of filters from code snippets. They seem comprehensive, covering different scenarios including multiple dataframes and non-default dataframe names. However, ensure that these tests cover all possible edge cases and that they fail when expected (i.e., they aren't false positives).

pandasai/connectors/sql.py (1)
  • 76-78: This regular expression only allows alphanumeric characters and underscores in column names. This might be too restrictive for some databases that allow other special characters in column names. Please verify this against your target SQL dialects.
pandasai/smart_dataframe/__init__.py (2)
  • 25-28: The functools.cached_property import is not used in the new hunks. Please ensure that it's necessary and remove if it's not.

  • 355-364: The load_connector method is delegating the call to self._core.load_connector, but there's no check to ensure that self._core is not None. This could lead to a NoneType error if self._core is None. Consider adding a check to ensure that `self._

Comment on lines +22 to +51
_engine = None
_connection: int = None
_rows_count: int = None
_columns_count: int = None
_cache_interval: int = 600 # 10 minutes

def __init__(self, config: Union[ConnectorConfig, dict], cache_interval: int = 600):
"""
Initialize the SQL connector with the given configuration.

Args:
config (ConnectorConfig): The configuration for the SQL connector.
"""
config = ConnectorConfig(**config)
super().__init__(config)

if config.dialect is None:
raise Exception("SQL dialect must be specified")

if config.driver:
self._engine = create_engine(
f"{config.dialect}+{config.driver}://{config.username}:{config.password}"
f"@{config.host}:{str(config.port)}/{config.database}"
)
else:
self._engine = create_engine(
f"{config.dialect}://{config.username}:{config.password}@{config.host}"
f":{str(config.port)}/{config.database}"
)
self._connection = self._engine.connect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The password is being included in the connection string which could lead to potential security issues if logs are not properly sanitized. Consider using a more secure method of passing credentials to the create_engine function.

-                f"{config.dialect}+{config.driver}://{config.username}:{config.password}"
-                f"@{config.host}:{str(config.port)}/{config.database}"
+                f"{config.dialect}+{config.driver}://",
+                username=config.username,
+                password=config.password,
+                host=config.host,
+                port=str(config.port),
+                database=config.database

Comment on lines +80 to +116
def _build_query(self, limit=None, order=None):
base_query = select("*").select_from(text(self._config.table))
valid_operators = ["=", ">", "<", ">=", "<=", "LIKE", "!=", "IN", "NOT IN"]

if self._config.where or self._additional_filters:
# conditions is the list of wher + additional filters
conditions = []
if self._config.where:
conditions += self._config.where
if self._additional_filters:
conditions += self._additional_filters

query_params = {}
condition_strings = []

for i, condition in enumerate(conditions):
if len(condition) == 3:
column_name, operator, value = condition
if operator in valid_operators:
self._validate_column_name(column_name)

condition_strings.append(f"{column_name} {operator} :value_{i}")
query_params[f"value_{i}"] = value

if condition_strings:
where_clause = " AND ".join(condition_strings)
base_query = base_query.where(
text(where_clause).bindparams(**query_params)
)

if order:
base_query = base_query.order_by(asc(text(order)))

if limit:
base_query = base_query.limit(limit)

return base_query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_query method constructs a SQL query by concatenating strings, which can potentially lead to SQL injection attacks if user-supplied input is not properly sanitized. Although you're validating column names with a regex, it's still safer to use parameterized queries or ORM tools to construct your SQL queries.

Comment on lines +170 to +181
filename = self._get_cache_path(
include_additional_filters=include_additional_filters
)
if not os.path.exists(filename):
return False

# If the file is older than 1 day, delete it
if os.path.getmtime(filename) < time.time() - self._cache_interval:
if self.logger:
self.logger.log(f"Deleting expired cached data from {filename}")
os.remove(filename)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation strategy based on time interval might not work well if the underlying data changes frequently within the cache interval. Consider using a more sophisticated invalidation strategy, such as listening for changes in the underlying data.

Comment on lines +253 to +261
# Run a SQL query to get the number of rows
query = sql.text(
"SELECT COUNT(*) FROM information_schema.columns "
"WHERE table_name = :table_name"
).bindparams(table_name=self._config.table)

# Return the number of rows
self._rows_count = self._connection.execute(query).fetchone()[0]
return self._rows_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query used to get the number of rows in the table seems incorrect. It's querying the information_schema.columns table, which will return the number of columns, not rows. The correct query should be something like SELECT COUNT(*) FROM {table_name}.

-        query = sql.text(
-            "SELECT COUNT(*) FROM information_schema.columns "
-            "WHERE table_name = :table_name"
-        ).bindparams(table_name=self._config.table)
+        query = sql.text(f"SELECT COUNT(*) FROM {self._config.table}")

Comment on lines +282 to +290
# Run a SQL query to get the number of columns
query = sql.text(
"SELECT COUNT(*) FROM information_schema.columns "
f"WHERE table_name = '{self._config.table}'"
)

# Return the number of columns
self._columns_count = self._connection.execute(query).fetchone()[0]
return self._columns_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous comment, the query used to get the number of columns in the table seems incorrect. It's querying the information_schema.columns table without specifying the database name, which will return the number of columns across all databases, not just the specified one. The correct query should be something like SELECT COUNT(*) FROM information_schema.columns WHERE table_schema = {database_name} AND table_name = {table_name}.

-        query = sql.text(
-            "SELECT COUNT(*) FROM information_schema.columns "
-            f"WHERE table_name = '{self._config.table}'"
-        )
+        query = sql.text(
+            "SELECT COUNT(*) FROM information_schema.columns "
+            f"WHERE table_schema = '{self._config.database}' AND table_name = '{self._config.table}'"
+        )

Comment on lines +211 to +217
def connector(self):
return self._connector

@connector.setter
def connector(self, connector: BaseConnector):
self._connector = connector

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connector property setter is not validating the input before assigning it to self._connector. This could lead to unexpected behavior if an invalid connector is passed. Consider adding validation logic to ensure that connector is a valid instance of BaseConnector before assigning it to self._connector.

Comment on lines 313 to 317
*middlewares: A list of middlewares

"""
self._dl.add_middlewares(*middlewares)
self.lake.add_middlewares(*middlewares)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_middlewares method is delegating the call to self.lake.add_middlewares, but there's no check to ensure that self.lake is not None. This could lead to a NoneType error if self.lake is None. Consider adding a check to ensure that self.lake is not None before calling its add_middlewares method.

Comment on lines 325 to +328
Raises:
ValueError: If the query is empty
"""
return self._dl.chat(query)
return self.lake.chat(query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chat method is delegating the call to self.lake.chat, but there's no check to ensure that self.lake is not None. This could lead to a NoneType error if self.lake is None. Consider adding a check to ensure that self.lake is not None before calling its chat method.

Comment on lines 330 to 342
def column_hash(self) -> str:
columns_str = "".join(self._df.columns)
"""
Get the hash of the columns of the dataframe.

Returns:
str: Hash of the columns of the dataframe
"""
if not self._core._df_loaded and self.connector:
return self.connector.column_hash

columns_str = "".join(self.dataframe.columns)
hash_object = hashlib.sha256(columns_str.encode())
return hash_object.hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column_hash method is not handling the case where self.connector is None, which could lead to an AttributeError when trying to access self.connector.column_hash. Consider adding a check to ensure that self.connector is not None before accessing its column_hash attribute.

Comment on lines 344 to 354
def save(self, name: str = None):
"""
Saves the dataframe configuration to be used for later

Args:
name (str, optional): Name of the dataframe configuration. Defaults to None.
"""

config_manager = DfConfigManager(self)
config_manager.save(name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The save method is creating a new instance of DfConfigManager every time it's called. If creating a DfConfigManager is expensive in terms of resources, consider reusing the same instance across multiple calls to save.

Copy link
Contributor

@nautics889 nautics889 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to the new release!

@Hrishikesh007788
Copy link

Hrishikesh007788 commented Sep 10, 2023

When it's going to release? excited for it!

* refactor: `to_dict()` parameters

* (refactor): update behaviour of `to_dict()` method, make it possible
  to accept `into` and `as_series` parameters (the last one is for
  polars dataframes).
* (tests): add tests for casting the dataframe to a dictionary, add
  tests for passing parameters in the proxy-call to dataframe's
  `to_dict()` method.

* docs: docstrings for `DataframeAbstract`

* (docs): add docstrings for proxy-methods in `DataframeAbstract` class

* chore: type hint issue

* (chore): update return type for `df_type` function
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Commits Files that changed from the base of the PR and between f5c4be0 and c05733b commits.
Files selected for processing (3)
  • pandasai/helpers/df_info.py (1 hunks)
  • pandasai/smart_dataframe/abstract_df.py (1 hunks)
  • tests/test_smartdataframe.py (13 hunks)
Additional comments (Suppressed): 11
pandasai/helpers/df_info.py (1)
  • 14-17: The function signature has been changed from returning str to Union[str, None]. Ensure that all calls to this function throughout the codebase can handle a None return value.
pandasai/smart_dataframe/abstract_df.py (1)
  • 1-4: The class DataframeAbstract now inherits from the ABC (Abstract Base Class) module. This is a good practice for defining abstract base classes in Python, as it provides a mechanism to define methods that must be created within any child classes built from the abstract class.
tests/test_smartdataframe.py (9)
  • 108-119: The new fixture smart_dataframe_mocked_df is a good addition as it allows for testing of the SmartDataframe with a mocked dataframe. This can help isolate tests and make them more reliable.

  • 130-137: The test test_init has been updated to reflect changes in the SmartDataframe class. The assertions now check for _table_name and _table_description instead of _name and _description. Also, engine and dataframe are checked instead of _engine and _df. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 223-262: The new tests test_to_dict and test_to_dict_passing_parameters are well written and cover important functionality of the to_dict method. They ensure that the method returns a dictionary with expected keys and that it correctly passes parameters based on the engine type.

  • 380-386: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [380-413]

The test test_error_prompt_replacement has been updated to reflect changes in the SmartDataframe class. The dataframe is now initialized with actual data and the lake._retry_run_code method is called instead of _dl._retry_run_code. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 420-432: The test test_logs has been updated to reflect changes in the SmartDataframe class. The lake.logger.log method is called instead of _dl._logger.log. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 439-453: The test test_updates_verbose_config_with_setters has been updated to reflect changes in the SmartDataframe class. The lake._logger.verbose property is checked instead of _dl._logger.verbose. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 460-471: The test test_updates_save_logs_config_with_setters has been updated to reflect changes in the SmartDataframe class. The lake._logger.save_logs property is checked instead of _dl._logger.save_logs. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 477-487: The test test_updates_enable_cache_config_with_setters has been updated to reflect changes in the SmartDataframe class. The lake.enable_cache property and lake._cache object are checked instead of _dl.enable_cache and _dl._cache. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 537-557: The tests for loading dataframes (test_load_dataframe_from_list_of_dicts, test_load_dataframe_from_dict, test_load_dataframe_from_pandas_dataframe, test_load_dataframe_from_saved_dfs, test_load_dataframe_from_other_dataframe_type) have been updated to reflect changes in the SmartDataframe class. The _load_dataframe method is called instead of _load_df and the dataframe property is checked instead of _df. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

Comment on lines +5 to +6
_engine: str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _engine attribute has been added to the class. It seems to be used to differentiate between different DataFrame implementations (e.g., pandas, polars). However, there's no default value or type hinting provided for this attribute. It would be better to provide a default value or at least a comment explaining what values are expected for this attribute.

-    _engine: str
+    _engine: str = None  # Expected values: 'pandas', 'polars'

Comment on lines +296 to +307
def to_dict(self, orient="dict", into=dict, as_series=True):
"""
A proxy-call to the dataframe's `.to_dict()`.
"""
if self._engine == "pandas":
return self.dataframe.to_dict(orient=orient, into=into)
elif self._engine == "polars":
return self.dataframe.to_dict(as_series=as_series)
raise RuntimeError(
f"{self.__class__} object has unknown engine type. "
f"Possible engines: 'pandas', 'polars'. Actual '{self._engine}'."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_dict method has been updated to handle different DataFrame engines. This is a good approach to make the class more flexible and adaptable to different DataFrame implementations. However, the error message in the RuntimeError could be improved by including the actual engine type.

-        raise RuntimeError(
-            f"{self.__class__} object has unknown engine type. "
-            f"Possible engines: 'pandas', 'polars'. Actual '{self._engine}'."
-        )
+        raise RuntimeError(f"Unknown engine type '{self._engine}' for {self.__class__}. "
+                           "Expected engine types are 'pandas' or 'polars'.")

Overall, the changes made in this PR seem to improve the flexibility and adaptability of the DataframeAbstract class. The addition of the _engine attribute allows the class to handle different DataFrame implementations, and the use of proxy-calls to DataFrame methods ensures that the class can work with any DataFrame that implements these methods. The use of the ABC module is also a good practice for defining abstract base classes in Python.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 7

Commits Files that changed from the base of the PR and between c05733b and aabd42c commits.
Files ignored due to filter (1)
  • pyproject.toml
Files selected for processing (10)
  • pandasai/helpers/code_manager.py (9 hunks)
  • pandasai/helpers/memory.py (1 hunks)
  • pandasai/prompts/base.py (1 hunks)
  • pandasai/prompts/generate_python_code.py (2 hunks)
  • pandasai/smart_dataframe/init.py (8 hunks)
  • pandasai/smart_datalake/init.py (16 hunks)
  • tests/prompts/test_correct_error_prompt.py (2 hunks)
  • tests/prompts/test_generate_python_code_prompt.py (2 hunks)
  • tests/test_smartdataframe.py (14 hunks)
  • tests/test_smartdatalake.py (4 hunks)
Files skipped from review due to trivial changes (3)
  • pandasai/prompts/base.py
  • pandasai/prompts/generate_python_code.py
  • tests/prompts/test_generate_python_code_prompt.py
Additional comments (Suppressed): 50
tests/prompts/test_correct_error_prompt.py (1)
  • 15-27: The FakeLLM and SmartDataframe objects are created but not used in the test. If they are necessary for the test, please ensure that they are utilized correctly. Otherwise, consider removing them to reduce unnecessary complexity.
tests/test_smartdatalake.py (4)
  • 72-77: The fixture smart_datalake has been updated to use the new attribute lake instead of the old datalake. Ensure that all references to this fixture in the test suite have been updated accordingly.

  • 118-124: The middleware setup and assertion have been updated to use the new lake attribute. This change is consistent with the update in the smart_datalake fixture.

  • 131-142: The mock method _get_sample_head returns a DataFrame directly, replacing the previous _get_head_csv which returned a string representation of a CSV. This change simplifies the code and makes it more efficient by avoiding unnecessary conversions between formats.

  • 149-164: The format of last_prompt has been updated to include additional information about the DataFrame(s) being used. This change improves the clarity of the prompt for users. However, ensure that this new format is correctly handled wherever last_prompt is used.

pandasai/smart_datalake/__init__.py (10)
  • 76-83: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 92-96: Similar to the previous comment, the logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 129-133: Again, the logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 189-192: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 267-268: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 280-281: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 303-309: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 345-346: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 400-402: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

  • 464-473: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.

pandasai/helpers/code_manager.py (4)
  • 1-11: The import statements are well organized and necessary for the functionality of the code. The addition of collections and node_visitors is justified by their usage in the new hunks.

  • 15-21: The addition of Generator from the typing module is justified as it's used in the function _tokenize_operand.

  • 28-45: The addition of _ast_comparatos_map and _current_code_executed attributes to the class seems to be necessary for the new functionality added to the class. However, please ensure that these attributes are properly initialized and cleaned up where necessary.

  • 295-302: The removal of "dfs" from the environment dictionary is justified as it's replaced by the use of _get_samples method in the execute_code method.

tests/test_smartdataframe.py (13)
  • 2-8: The import of collections module is not used in the new hunks. Please ensure that it's used elsewhere in the code or remove it if it's unnecessary.
- from collections import defaultdict
  • 108-119: The fixture smart_dataframe_mocked_df has been added. Ensure that this fixture is being used in the tests and that it provides the necessary mock setup for the tests.

  • 130-137: The test test_init and test_init_without_llm have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 181-217: The expected prompt in the test test_last_prompt_with_privacy_enforced has been updated. Make sure that this change reflects the actual output of the function under test.

  • 221-264: The test test_to_dict_passing_parameters has been added. This test checks whether the correct parameters are passed to the to_dict method depending on the engine type. Ensure that this test covers all possible scenarios and edge cases.

  • 381-387: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [381-414]

The test test_retry_run_code_with_custom_prompt has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 421-433: The test test_log has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 438-454: The test test_updates_verbose_config_with_setters has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 460-474: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [460-488]

The test test_updates_enable_cache_config_with_setters has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 490-491: The test test_updates_configs_with_setters has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 538-559: The tests test_load_dataframe_from_list_of_dicts, test_load_dataframe_from_dict, test_load_dataframe_from_pandas_dataframe, and test_load_dataframe_from_saved_dfs have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 575-591: The tests test_load_dataframe_from_other_dataframe_type and test_import_csv_file have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

  • 637-647: The tests test_import_invalid_file and test_import_pandas_series have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to the SmartDataframe class.

pandasai/smart_dataframe/__init__.py (18)
  • 22-28: The import of cached_property from functools is new but it's not used anywhere in the provided hunks. Please ensure that it's used somewhere else in the code or remove it if it's unnecessary.

  • 41-59: The class SmartDataframeCore has been introduced which seems to handle the core functionality of loading and managing dataframes. This is a good practice as it separates concerns and makes the code more maintainable.

  • 65-93: The _load_dataframe method has been refactored to handle different types of inputs including connectors, strings (file paths), pandas Series, lists, and dictionaries. The error handling for invalid input data is also well implemented.

  • 41-95: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [94-115]

The _import_from_file method is missing in the new hunk. If this method is defined elsewhere in the code, please ensure that it correctly handles file imports and raises an appropriate error for invalid file formats.

  • 129-151: The _validate_and_convert_dataframe method is a good addition as it encapsulates the logic for validating and converting the dataframe input into a single function. This improves the readability and maintainability of the code.

  • 153-174: The methods load_connector and _unload_connector are well implemented to manage the loading and unloading of connectors. Good use of instance variables to track the state of the loaded dataframe.

  • 176-192: The getter and setter for the dataframe property have been well implemented. They handle the loading and unloading of the connector when necessary and validate and convert the dataframe before setting it.

  • 207-220: The properties engine, connector, and has_connector are well implemented and provide a clean interface for accessing these attributes.

  • 223-305: The SmartDataframe class has been refactored to use the SmartDataframeCore for handling the core functionality. This is a good practice as it separates concerns and makes the code more maintainable.

  • 313-317: The add_middlewares method now delegates to the lake attribute instead of the previously used _dl attribute. Please ensure that the lake attribute is correctly initialized and has the add_middlewares method.

  • 325-328: The chat method now delegates to the lake attribute instead of the previously used _dl attribute. Please ensure that the lake attribute is correctly initialized and has the chat method.

  • 330-342: The column_hash method has been updated to handle the case where the dataframe is not loaded and a connector is present. This is a good improvement as it ensures that the column hash can still be obtained in this scenario.

  • 344-354: The save method remains unchanged and continues to use the DfConfigManager for saving the dataframe configuration. This is consistent with the previous implementation.

  • 355-364: The load_connector method now delegates to the _core attribute. This is consistent with the new design of using SmartDataframeCore for handling the core functionality.

  • 365-403: The _truncate_head_columns method is a new addition and it seems to handle the truncation of column values well for both pandas and polars dataframes. However, please ensure that the polars package is installed if it's required by your project.

  • 405-419: The _get_sample_head method has been updated to handle the case where the dataframe is not loaded and a connector is present. This is a good improvement as it ensures that a sample head can still be obtained in this scenario.

  • 325-423: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [421-427]

The _load_from_config method remains unchanged and continues to use the DfConfigManager for loading the saved dataframe configuration. This is consistent with the previous implementation.

  • 429-445: The properties dataframe, engine, connector, and the setter for connector now delegate to the _core attribute. This is consistent with the new design of using SmartDataframeCore for handling the core functionality.

Comment on lines +29 to 30
prompt.to_string()
== """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method to_string() is called on the prompt object, but it seems like this should be __str__(), based on the old hunk and the description of the test (test_str_with_args). Please verify if this is a typo or an intended change.

-            prompt.to_string()
+            str(prompt)

Comment on lines 24 to 32
def get_conversation(self, limit: int = 1) -> str:
return "\n".join(
[
f"{'User' if message['is_user'] else 'Bot'}: {message['message']}"
for message in self._messages[-limit:]
f"{f'User {i+1}' if message['is_user'] else f'Assistant {i}'}: "
f"{message['message']}"
for i, message in enumerate(self._messages[-limit:])
]
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation of get_conversation method introduces numbering for the User and Assistant messages. However, there's a potential issue with the numbering logic. The enumeration starts from 0, so the first user message will be "User 1" but the first assistant message will be "Assistant 0". To keep consistency, both should start from 1.

- f"{f'User {i+1}' if message['is_user'] else f'Assistant {i}'}: "
+ f"{f'User {i+1}' if message['is_user'] else f'Assistant {i+1}'}: "

Comment on lines 158 to 182
if not use_error_correction_framework:
raise exc

def _required_dfs(self, code: str) -> List[str]:
"""
List the index of the DataFrames that are needed to execute the code. The goal
is to avoid to run the connectors if the code does not need them.

Args:
code (str): Python code to execute

Returns:
List[int]: A list of the index of the DataFrames that are needed to execute
the code.
"""

required_dfs = []
for i, df in enumerate(self._dfs):
if f"dfs[{i}]" in code:
required_dfs.append(df)
else:
required_dfs.append(None)
return required_dfs

def execute_code(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the _required_dfs method is a good improvement as it helps to avoid unnecessary execution of connectors when they are not needed by the code. This can potentially improve performance.

Comment on lines 200 to 206
on the generated code.

"""
self._current_code_executed = code

for middleware in self._middlewares:
code = middleware(code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment of self._current_code_executed = code before the middleware loop is a good practice as it allows the current executing code to be tracked and used in other methods.

Comment on lines 225 to 235
```"""
)

# List the required dfs, so we can avoid to run the connectors
# if the code does not need them
dfs = self._required_dfs(code_to_run)
environment: dict = self._get_environment()
environment["dfs"] = self._get_samples(dfs)

caught_error = self._execute_catching_errors(code_to_run, environment)
if caught_error is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of _required_dfs in the execute_code method is a good improvement as it helps to avoid unnecessary execution of connectors when they are not needed by the code. This can potentially improve performance.

Comment on lines 242 to +289

analyze_data = environment.get("analyze_data", None)

return analyze_data(self._get_original_dfs())
return analyze_data(self._get_originals(dfs))

def _get_samples(self, dfs):
"""
Get samples from the dfs

Args:
dfs (list): List of dfs

def _get_original_dfs(self):
dfs = []
for df in self._dfs:
if df.engine == "polars":
dfs.append(df.original.to_pandas())
Returns:
list: List of samples
"""
samples = []
for df in dfs:
if df is not None:
samples.append(df.head_df)
else:
dfs.append(df.original)
samples.append(None)
return samples

return dfs
def _get_originals(self, dfs):
"""
Get original dfs

Args:
dfs (list): List of dfs

Returns:
list: List of dfs
"""
original_dfs = []
for index, df in enumerate(dfs):
if df is None:
original_dfs.append(None)
continue

if df.has_connector:
extracted_filters = self._extract_filters(self._current_code_executed)
filters = extracted_filters.get(f"dfs[{index}]", [])
df.connector.set_additional_filters(filters)
df.load_connector(temporary=len(filters) > 0)

original_dfs.append(df.dataframe)

return original_dfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of _get_samples and _get_originals methods is a good improvement as it provides a way to get samples and original dataframes from the list of dataframes. This can potentially improve readability and maintainability of the code.

Comment on lines 467 to 688
}
"""
comparisons = defaultdict(list)
current_df = "dfs[0]"

visitor = AssignmentVisitor()
visitor.visit(tree)
assignments = visitor.assignment_nodes

call_visitor = CallVisitor()
call_visitor.visit(tree)
calls = call_visitor.call_nodes

for node in ast.walk(tree):
if isinstance(node, ast.Compare):
is_call_on_left = isinstance(node.left, ast.Call)
is_polars = False
is_calling_col = False
try:
is_polars = node.left.func.value.id in ("pl", "polars")
is_calling_col = node.left.func.attr == "col"
except AttributeError:
pass

if is_call_on_left and is_polars and is_calling_col:
df_name = self._get_nearest_func_call(
node.lineno, calls, "filter"
).func.value.id
current_df = self._get_df_id_by_nearest_assignment(
node.lineno, assignments, df_name
)
left_str = node.left.args[0].value

for op, right in zip(node.ops, node.comparators):
op_str = self._ast_comparatos_map.get(type(op), "Unknown")
right_str = right.value

comparisons[current_df].append((left_str, op_str, right_str))
elif isinstance(node.left, ast.Subscript):
name, *slices = self._tokenize_operand(node.left)
current_df = (
self._get_df_id_by_nearest_assignment(
node.lineno, assignments, name
)
or current_df
)
left_str = name if not slices else slices[-1]

for op, right in zip(node.ops, node.comparators):
op_str = self._ast_comparatos_map.get(type(op), "Unknown")
name, *slices = self._tokenize_operand(right)
right_str = name if not slices else slices[-1]

comparisons[current_df].append((left_str, op_str, right_str))
return comparisons

def _extract_filters(self, code) -> dict[str, list]:
"""
Extract filters to be applied to the dataframe from passed code.

Args:
code (str): A snippet of code to be parsed.

Returns:
dict: The dictionary containing all filters parsed from
the passed code. The dictionary has the following structure:
{
"<df_number>": [
("<left_operand>", "<operator>", "<right_operand>")
]
}

Raises:
SyntaxError: If the code is unable to be parsed by `ast.parse()`.
Exception: If any exception is raised during working with nodes
of the code tree.
"""
try:
parsed_tree = ast.parse(code)
except SyntaxError:
self._logger.log(
"Invalid code passed for extracting filters", level=logging.ERROR
)
self._logger.log(f"{traceback.format_exc()}", level=logging.DEBUG)
raise

try:
filters = self._extract_comparisons(parsed_tree)
except Exception:
self._logger.log(
"Unable to extract filters for passed code", level=logging.ERROR
)
self._logger.log(f"{traceback.format_exc()}", level=logging.DEBUG)
raise

return filters

@property
def middlewares(self):
return self._middlewares
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of several utility methods and properties like _get_nearest_func_call, _tokenize_operand, _get_df_id_by_nearest_assignment, _extract_comparisons, _extract_filters is a good improvement as it provides a way to extract filters from the code, tokenize operands, get dataframe id by nearest assignment, etc. This can potentially improve readability and maintainability of the code.

@gventuri gventuri merged commit db85725 into main Sep 13, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants