-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 1.2 #535
Version 1.2 #535
Conversation
* fix: setters were not overriding some properties in some cases * fix: remove multiple df overwrites (#436) * improved on _is_df_overwrite method to check all targets * test: add test for multiple df overwrite --------- Co-authored-by: Gabriele Venturi <[email protected]> * fix: prevent to_* pandas methods from running * Release v0.8.3 * fix: environment for executing code (#419) * fix: environment for executing code * (fix): set `matplot.pyplot` package under "plt" alias in `_get_environment()` * feat: Try to import a package when NameError (#430) * (refactor): split calling of `exec()` for code and handling of errors into separate methods; * (feat): add trying to import the `exc.name` and add one to context when `exc` is an instance of `NameError`; * (tests): add tests for new methods; * fix: add logging for `NameError` handling (#430) * (fix): add log a general exception raised when trying to fix `NameError` * fix: compatibility with python 3.9 (#430) * (fix): update `handle_error()` method, add parsing `exc.args[0]` for cases with NameError if there is no attribute "name" of object of the exception. * fix: update for `handle_error()`(#430) * (fix): add checking if the library to be imported is preset in the whitelist of third-packages; * (fix): remove "plt" when forming environment for code to be run; * (fix): add raising an exception that was caught during fixing code with the error correction framework; * (tests): add test-case for retrying to fix code with the error correction framework unsuccessfully; * (tests): fix issue with inappropriate instantiation of numpy's `array` object within several test methods; * Release v0.8.4 * Release v1.0b1 * Release v1.0 * fix: convert polars to pandas df and back to polars after exec the code * docs: update docs for v1 * Release v1.0.1 * fix: remove shortcuts from backwords compatibile version * Release v1.0.2 * refactor: minor updates (#446) * (refactor): inappropriate type hint for `_has_run` in base class for middlewares; * (style): remove built-ins name shadowing in `__init__()` method of `Prompt` class; * docs: update LLMs documentation code to match the last version (#445) I have updated OpenAI, Starcoder, Falcon, and AzureOpenAI. But I haven't edited GooglePalm because it still uses api_key in the last version not api_token Co-authored-by: Gabriele Venturi <[email protected]> * fix: do not override setters and getters with __getattr__ * Release v1.0.3 * fix: huggingface models (#454) * Release v1.0.4 * fix: update code manager to convert prompt id to string (#462) * test: increase test coverage (#460) * Update LLMs documentation code to match the last version I have updated OpenAI, Starcoder, Falcon, and AzureOpenAI. But I haven't edited GooglePalm because it still uses api_key in the last version not api_token * Create test_google_vertexai.py Create test cases for Class Google Vertexai. I have tested initailization function with default model provided and custom one as well. Then I have built test cases for validation of class when model provided and without model. * Update test_smartdataframe.py Created test cases for _load_df() and _import_from_file() functions * test: remove verbose test * test: unskip skipped test * test: test multiple invalid file formats * test: fix tests --------- Co-authored-by: Gabriele Venturi <[email protected]> * fix: recursive error on df last_result property (#468) (#469) * fix: recursive error on df last_result property (#468) * fix: change last_result type to str * test: add tests --------- Co-authored-by: Gabriele Venturi <[email protected]> * Release v1.0.5 * fix(llms): AzureOpenAI call method signature (#476) * fix: added last_error property reference in smart_datalake (#475) * fix: temporarely disable multi-turn capacity, causing hallucinations * Release v1.0.6 * fix: create required folders as the datalake is initialized * Release v1.0.7 * fix: create required folders in the project folder * Release v1.0.8 * chore: fix typing in the logger logs (#464) * docs: add video explanation for SmartDataframe * feat: extracting filters (#451) * (feat): add `_extract_filters()`, `_extract_comparisons()`, `_tokenize_operand()` methods to `CodeManager`; * (tests): add basic tests; * test: extracting filters (#451) * (tests): rename `test_extract_filters()` to `test_extract_filters_col_index()`, update asserting according to task requirements; * (tests): add `test_extract_filters_col_index_multiple_df()` for multiple dfs; * fix: update according to requirements (#451) * (feat): add module `node_visitor.py`, add `AssignmentVisitor` in `node_visitors.py`; * (fix): `_extract_filters()` now returns dictionary; * (feat): add grouping by df number when forming; * (tests): update naming in test cases; * (tests): add test method with non default (not `df`) variable name; * feat: extracting filters (#451) * (feat): add `CallVisitor` to `node_visitors.py` module; * (feat): add extracting comparisions in `filter()` method being called as an attribute of object named `pl` or `polars`; * (tests): test methods for extracting polars-like filters; * refactor: extracting filters (#451) * (refactor): add handling exceptions in `_extract_filters()` method; * docs: extracting filters (#451) * (docs): add documentations for implemented functions; * (style): type hinting; * test: merge similar tests --------- Co-authored-by: Gabriele Venturi <[email protected]> Co-authored-by: Aymane Hachcham <[email protected]> Co-authored-by: Omar Elsherif <[email protected]> Co-authored-by: Hanchung Lee <[email protected]> Co-authored-by: Sanchit Bhavsar <[email protected]> Co-authored-by: Massimiliano Pronesti <[email protected]>
WalkthroughThe changes introduce a new architecture for managing dataframes and connectors, improve code execution logic, enhance error handling, and add comprehensive unit tests. The updates also include better logging, caching, and memory management. The modifications aim to increase modularity, maintainability, and test coverage. Changes
TipsChat with CodeRabbit Bot (
|
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## main #535 +/- ##
==========================================
- Coverage 87.14% 83.39% -3.75%
==========================================
Files 47 52 +5
Lines 2030 2584 +554
==========================================
+ Hits 1769 2155 +386
- Misses 261 429 +168
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more 📢 Have feedback on the report? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 34
Files ignored due to filter (2)
- poetry.lock
- pyproject.toml
Files selected for processing (23)
- examples/from_sql.py (1 hunks)
- pandasai/connectors/init.py (1 hunks)
- pandasai/connectors/base.py (1 hunks)
- pandasai/connectors/sql.py (1 hunks)
- pandasai/connectors/yahoo_finance.py (1 hunks)
- pandasai/helpers/init.py (2 hunks)
- pandasai/helpers/code_manager.py (9 hunks)
- pandasai/helpers/df_config_manager.py (7 hunks)
- pandasai/helpers/df_info.py (1 hunks)
- pandasai/helpers/node_visitors.py (1 hunks)
- pandasai/prompts/base.py (1 hunks)
- pandasai/prompts/correct_error_prompt.py (3 hunks)
- pandasai/schemas/df_config.py (1 hunks)
- pandasai/smart_dataframe/init.py (8 hunks)
- pandasai/smart_dataframe/abstract_df.py (1 hunks)
- pandasai/smart_datalake/init.py (17 hunks)
- tests/connectors/test_base.py (1 hunks)
- tests/connectors/test_sql.py (1 hunks)
- tests/prompts/test_correct_error_prompt.py (2 hunks)
- tests/test_codemanager.py (3 hunks)
- tests/test_pandasai.py (1 hunks)
- tests/test_smartdataframe.py (10 hunks)
- tests/test_smartdatalake.py (4 hunks)
Files skipped from review due to trivial changes (6)
- pandasai/connectors/init.py
- pandasai/helpers/df_info.py
- pandasai/prompts/base.py
- pandasai/prompts/correct_error_prompt.py
- pandasai/schemas/df_config.py
- tests/test_smartdataframe.py
Additional comments (Suppressed): 50
examples/from_sql.py (1)
- 43-43: The question "How many people from the United states?" seems unrelated to the data being queried (loans and payments). Ensure that the chat query aligns with the data context.
pandasai/helpers/__init__.py (2)
- 5-11: The import for
Logger
has been added, but the corresponding import forget_openai_callback
seems to be missing in this hunk. Please ensure that it's imported correctly.+ from .logger import Logger - from .get_openai_callback import get_openai_callback
- 16-20: The addition of
Logger
to the__all__
list is correct and aligns with the new import.pandasai/helpers/node_visitors.py (1)
- 1-19: The
AssignmentVisitor
andCallVisitor
classes are well implemented. They both inherit fromast.NodeVisitor
and override the appropriate visit methods to collect assignment and call nodes respectively. The use ofself.generic_visit(node)
ensures that all child nodes are visited.pandasai/connectors/sql.py (4)
41-48: The password is directly embedded into the connection string, which could potentially expose sensitive information in logs or error messages. Consider using URL encoding for the password or use SQLAlchemy's
create_engine
method that accepts the username and password as separate parameters.165-176: The cache invalidation strategy based on time might not work correctly if the data in the database changes frequently. Consider using a more robust invalidation strategy like listening for changes in the database or using a hash of the data.
296-313: The hash of the columns is calculated by concatenating the column names and conditions. This can lead to collisions if different columns have similar names. Consider using a more robust hashing algorithm that takes into account the order and number of columns.
345-354: Environment variables are used as fallbacks for the configuration parameters. This can lead to unexpected behavior if the environment variables are set but the corresponding parameters are not included in the configuration. Make sure to document this behavior clearly.
pandasai/connectors/base.py (1)
- 63-70: The method
set_additional_filters
accepts a dictionary as filters but the class variable_additional_filters
is a list of lists. There seems to be a mismatch in types here. Please verify if this is intended or if there's an error.pandasai/helpers/df_config_manager.py (9)
36-40: The change from
self.name
toself._sdf.table_name
is noted. Ensure that all instances ofself.name
have been replaced withself._sdf.table_name
throughout the codebase.47-51: The change from
self.name
toself._sdf.table_name
in the_check_for_duplicates
method is noted. Make sure that this change does not affect other parts of the codebase where this method is called.58-73: The addition of handling connectors and checking for
.parquet
files and Google Sheets links is a good improvement. However, ensure that the new checks are compatible with the rest of the codebase and that they do not introduce any unexpected behavior.79-88: The change from
self._sdf.original
toself._sdf.dataframe
in the_get_import_path
method is noted. Make sure that this change does not affect other parts of the codebase where this method is called.92-101: The addition of the
name
argument in thesave
method is a good improvement. However, ensure that this change is reflected wherever this method is called in the codebase.118-124: The changes in the dictionary keys from
self.description
andself.head_csv
toself._sdf.table_description
andself._sdf.head_csv
respectively are noted. Ensure that these changes are reflected wherever this dictionary is used in the codebase.156-162: The change from
self._sdf.name
toself._sdf.table_name
in thename
property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.164-166: The change from
self._sdf.description
toself._sdf.table_description
in thedescription
property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.168-170: The change from
self._sdf.original_import
toself._sdf._original_import
in theoriginal_import
property is noted. Make sure that this change does not affect other parts of the codebase where this property is accessed.pandasai/helpers/code_manager.py (9)
1-11: The import of
defaultdict
from thecollections
module and theAssignmentVisitor
,CallVisitor
classes from the.node_visitors
module are new additions. Ensure that these modules exist and are accessible in the project's structure.15-21: The addition of
Generator
to the imported types from thetyping
module is a new change. Make sure that this type is used appropriately in the code.28-45: The
_ast_comparatos_map
dictionary has been added as a class attribute. This dictionary maps AST comparison nodes to their string equivalents, which will be useful for parsing comparisons from the code. The_current_code_executed
attribute has also been added to keep track of the current code being executed.158-181: A new method
_required_dfs
has been introduced. This method determines which dataframes are required to execute a given piece of code by checking if the dataframe index is referenced in the code. This can help avoid unnecessary dataframe loading if the dataframe isn't needed for the code execution.200-206: The
_current_code_executed
attribute is now set to the code being executed before the middlewares are applied. This allows the current code being executed to be tracked throughout the execution process.225-233: The
_required_dfs
method is now called before executing the code to determine which dataframes are needed. The resulting list of required dataframes is then passed to the_get_samples
method to get sample dataframes for execution. This helps optimize the code execution by only loading the necessary dataframes.242-289: The
_get_samples
and_get_originals
methods have been introduced. These methods return samples and original versions of the dataframes respectively. In_get_originals
, if a dataframe has a connector, additional filters extracted from the current code are set on the connector and the connector is loaded. This ensures that only the necessary data is loaded from the connector based on the filters.295-302: The environment now includes the pandas library (
pd
) and any additional dependencies that have been imported. Thedfs
variable is no longer included in the environment here, as it is now handled separately with the_get_samples
method.467-688: Several new methods and a staticmethod have been added to handle the extraction of filters from the code. These include
_get_nearest_func_call
,_tokenize_operand
,_get_df_id_by_nearest_assignment
,_extract_comparisons
, and_extract_filters
. These methods work together to parse the code, identify filter operations on the dataframes, and extract these filters into a structured format. This allows for more efficient filtering of the dataframes based on the code being executed.pandasai/smart_dataframe/__init__.py (3)
24-28: The
functools.cached_property
import is not used in the new hunks. Please ensure that it's necessary and remove if it's not.114-115: The
_import_from_file
method should handle different file formats such as CSV, Excel, Parquet etc. The current implementation does not specify what file formats are supported. Please ensure that all necessary file formats are handled.443-444: The
validate
method is mentioned but its implementation is not provided in the new hunks. Ensure that it's implemented correctly and handles potential errors.tests/connectors/test_sql.py (1)
- 19-29: Hardcoding credentials in the code is a bad practice and poses a security risk. It's better to use environment variables or secure vaults to store sensitive data like usernames, passwords, hostnames etc. Please ensure that this is only for testing purposes and no real credentials are used here.
tests/test_pandasai.py (1)
- 64-69: Ensure that all calls to the
run
method throughout the codebase have been updated to match the new input argument fromNone
to0
.pandasai/smart_datalake/__init__.py (12)
76-83: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.92-96: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.129-133: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.187-192: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.277-281: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.300-309: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.343-345: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.354-354: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.432-432: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.400-403: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.466-467: The logger attribute has been renamed from
_logger
tologger
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.470-475: The logger attribute has been renamed from
_logger
tologger
. This change also introduces a setter for the logger attribute. Ensure that all references to this attribute throughout the codebase have been updated to match the new name, and that the setter is used appropriately where necessary.tests/test_codemanager.py (2)
71-72: The reference to the
CodeManager
instance has been updated fromsmart_dataframe._dl._code_manager
tosmart_dataframe.lake._code_manager
. Ensure that this change is consistent with the updates made in theSmartDataframe
class.174-179: The assertion
assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0])
has been removed. If this assertion is no longer necessary due to changes in the codebase, then this is fine. However, if it was removed by mistake, it should be added back to ensure the correctness of the test.tests/test_smartdatalake.py (4)
73-74: The attribute name has been changed from
datalake
tolake
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.118-124: The attribute name has been changed from
_dl
tolake
. Ensure that all references to this attribute throughout the codebase have been updated to match the new name.134-142: The method name has been changed from
_get_head_csv
to_get_sample_head
. Also, the return type of the method has been changed from a string to a DataFrame. Ensure that all calls to this method throughout the codebase have been updated to match the new signature and handle the new return type correctly.149-163: The format of the
last_prompt
string has been changed. This might affect any code that relies on the specific format of this string. Verify that these changes do not break any existing functionality.
8951ebe
to
0223cdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 9
Files selected for processing (4)
- pandasai/connectors/sql.py (1 hunks)
- pandasai/helpers/code_manager.py (9 hunks)
- tests/connectors/test_sql.py (1 hunks)
- tests/test_codemanager.py (3 hunks)
Additional comments (Suppressed): 13
tests/connectors/test_sql.py (1)
- 51-58: The
_build_query
method seems to be building a SQL query string directly from input parameters. This can lead to SQL injection attacks if the inputs are not properly sanitized. Consider using parameterized queries or ORM tools to prevent this.pandasai/connectors/sql.py (1)
- 57-57: The
__del__
method is not a reliable way to close database connections because it's not guaranteed to be called when the object is garbage collected. It would be better to use a context manager or explicitly close the connection when you're done with it.tests/test_codemanager.py (2)
71-72: The reference to the
CodeManager
instance has been updated fromsmart_dataframe._dl._code_manager
tosmart_dataframe.lake._code_manager
. Ensure that thelake
attribute ofSmartDataframe
correctly points to an instance ofSmartDatalake
, which contains the_code_manager
attribute.174-179: The test case is checking for the presence of "pd", "plt", and "np" in the environment. However, it seems like the assertion for the dataframe equality with the environment has been removed. If this was intentional and the dataframe check is no longer relevant, then this change is fine. Otherwise, consider adding it back.
+ assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0]) assert "pd" in code_manager._get_environment() assert "plt" in code_manager._get_environment() assert "np" in code_manager._get_environment()
pandasai/helpers/code_manager.py (9)
1-11: The import of
defaultdict
from thecollections
module and the import ofAssignmentVisitor
andCallVisitor
from.node_visitors
are new. Ensure that these modules exist and are accessible from this location.15-21: The addition of
Generator
to the imported types from thetyping
module is new. Ensure that it's used appropriately in the code.28-45: The addition of
_ast_comparatos_map
and_current_code_executed
attributes are new. Make sure they are used correctly throughout the class.158-180: The new method
_required_dfs
is introduced to determine the required DataFrames for executing the code. This is a good approach to avoid unnecessary execution of connectors when not needed.200-206: The assignment of
self._current_code_executed = code
is new. Ensure that this attribute is used correctly in the subsequent code.225-235: The introduction of the
_required_dfs
method in theexecute_code
method is a good practice as it helps to avoid unnecessary execution of connectors when not needed.242-263: The new methods
_get_samples
and_get_originals
are introduced to get samples and original dataframes respectively. These methods seem to be well implemented.295-302: The removal of "dfs" from the environment dictionary might affect the code execution if "dfs" is referenced in the executed code. Please verify this change.
467-688: The addition of several utility methods and properties such as
_get_nearest_func_call
,_tokenize_operand
,_get_df_id_by_nearest_assignment
,_extract_comparisons
,_extract_filters
and the propertymiddlewares
are new. These additions seem to enhance the functionality of theCodeManager
class by providing more granular control over code execution, error handling, and AST node processing. However, ensure that these new methods and properties are used correctly in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 16
Files selected for processing (4)
- pandasai/connectors/sql.py (1 hunks)
- pandasai/smart_dataframe/init.py (8 hunks)
- tests/connectors/test_yahoo_finance.py (1 hunks)
- tests/test_codemanager.py (3 hunks)
Files skipped from review due to trivial changes (1)
- tests/connectors/test_yahoo_finance.py
Additional comments (Suppressed): 6
tests/test_codemanager.py (3)
69-75: The fixture
code_manager
has been updated to reflect changes in the structure of theSmartDataframe
class. Ensure that all tests using this fixture have been updated accordingly.174-179: The assertion
assert smart_dataframe.equals(code_manager._get_environment()["dfs"][0])
has been removed. If this is intentional and the test case still covers the intended functionality, then it's fine. Otherwise, consider adding it back or replacing it with an equivalent assertion.289-504: These new tests are designed to verify the extraction of filters from code snippets. They seem comprehensive, covering different scenarios including multiple dataframes and non-default dataframe names. However, ensure that these tests cover all possible edge cases and that they fail when expected (i.e., they aren't false positives).
pandasai/connectors/sql.py (1)
- 76-78: This regular expression only allows alphanumeric characters and underscores in column names. This might be too restrictive for some databases that allow other special characters in column names. Please verify this against your target SQL dialects.
pandasai/smart_dataframe/__init__.py (2)
25-28: The
functools.cached_property
import is not used in the new hunks. Please ensure that it's necessary and remove if it's not.355-364: The
load_connector
method is delegating the call toself._core.load_connector
, but there's no check to ensure thatself._core
is not None. This could lead to aNoneType
error ifself._core
is None. Consider adding a check to ensure that `self._
_engine = None | ||
_connection: int = None | ||
_rows_count: int = None | ||
_columns_count: int = None | ||
_cache_interval: int = 600 # 10 minutes | ||
|
||
def __init__(self, config: Union[ConnectorConfig, dict], cache_interval: int = 600): | ||
""" | ||
Initialize the SQL connector with the given configuration. | ||
|
||
Args: | ||
config (ConnectorConfig): The configuration for the SQL connector. | ||
""" | ||
config = ConnectorConfig(**config) | ||
super().__init__(config) | ||
|
||
if config.dialect is None: | ||
raise Exception("SQL dialect must be specified") | ||
|
||
if config.driver: | ||
self._engine = create_engine( | ||
f"{config.dialect}+{config.driver}://{config.username}:{config.password}" | ||
f"@{config.host}:{str(config.port)}/{config.database}" | ||
) | ||
else: | ||
self._engine = create_engine( | ||
f"{config.dialect}://{config.username}:{config.password}@{config.host}" | ||
f":{str(config.port)}/{config.database}" | ||
) | ||
self._connection = self._engine.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The password is being included in the connection string which could lead to potential security issues if logs are not properly sanitized. Consider using a more secure method of passing credentials to the create_engine
function.
- f"{config.dialect}+{config.driver}://{config.username}:{config.password}"
- f"@{config.host}:{str(config.port)}/{config.database}"
+ f"{config.dialect}+{config.driver}://",
+ username=config.username,
+ password=config.password,
+ host=config.host,
+ port=str(config.port),
+ database=config.database
def _build_query(self, limit=None, order=None): | ||
base_query = select("*").select_from(text(self._config.table)) | ||
valid_operators = ["=", ">", "<", ">=", "<=", "LIKE", "!=", "IN", "NOT IN"] | ||
|
||
if self._config.where or self._additional_filters: | ||
# conditions is the list of wher + additional filters | ||
conditions = [] | ||
if self._config.where: | ||
conditions += self._config.where | ||
if self._additional_filters: | ||
conditions += self._additional_filters | ||
|
||
query_params = {} | ||
condition_strings = [] | ||
|
||
for i, condition in enumerate(conditions): | ||
if len(condition) == 3: | ||
column_name, operator, value = condition | ||
if operator in valid_operators: | ||
self._validate_column_name(column_name) | ||
|
||
condition_strings.append(f"{column_name} {operator} :value_{i}") | ||
query_params[f"value_{i}"] = value | ||
|
||
if condition_strings: | ||
where_clause = " AND ".join(condition_strings) | ||
base_query = base_query.where( | ||
text(where_clause).bindparams(**query_params) | ||
) | ||
|
||
if order: | ||
base_query = base_query.order_by(asc(text(order))) | ||
|
||
if limit: | ||
base_query = base_query.limit(limit) | ||
|
||
return base_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _build_query
method constructs a SQL query by concatenating strings, which can potentially lead to SQL injection attacks if user-supplied input is not properly sanitized. Although you're validating column names with a regex, it's still safer to use parameterized queries or ORM tools to construct your SQL queries.
filename = self._get_cache_path( | ||
include_additional_filters=include_additional_filters | ||
) | ||
if not os.path.exists(filename): | ||
return False | ||
|
||
# If the file is older than 1 day, delete it | ||
if os.path.getmtime(filename) < time.time() - self._cache_interval: | ||
if self.logger: | ||
self.logger.log(f"Deleting expired cached data from {filename}") | ||
os.remove(filename) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache invalidation strategy based on time interval might not work well if the underlying data changes frequently within the cache interval. Consider using a more sophisticated invalidation strategy, such as listening for changes in the underlying data.
# Run a SQL query to get the number of rows | ||
query = sql.text( | ||
"SELECT COUNT(*) FROM information_schema.columns " | ||
"WHERE table_name = :table_name" | ||
).bindparams(table_name=self._config.table) | ||
|
||
# Return the number of rows | ||
self._rows_count = self._connection.execute(query).fetchone()[0] | ||
return self._rows_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query used to get the number of rows in the table seems incorrect. It's querying the information_schema.columns
table, which will return the number of columns, not rows. The correct query should be something like SELECT COUNT(*) FROM {table_name}
.
- query = sql.text(
- "SELECT COUNT(*) FROM information_schema.columns "
- "WHERE table_name = :table_name"
- ).bindparams(table_name=self._config.table)
+ query = sql.text(f"SELECT COUNT(*) FROM {self._config.table}")
# Run a SQL query to get the number of columns | ||
query = sql.text( | ||
"SELECT COUNT(*) FROM information_schema.columns " | ||
f"WHERE table_name = '{self._config.table}'" | ||
) | ||
|
||
# Return the number of columns | ||
self._columns_count = self._connection.execute(query).fetchone()[0] | ||
return self._columns_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, the query used to get the number of columns in the table seems incorrect. It's querying the information_schema.columns
table without specifying the database name, which will return the number of columns across all databases, not just the specified one. The correct query should be something like SELECT COUNT(*) FROM information_schema.columns WHERE table_schema = {database_name} AND table_name = {table_name}
.
- query = sql.text(
- "SELECT COUNT(*) FROM information_schema.columns "
- f"WHERE table_name = '{self._config.table}'"
- )
+ query = sql.text(
+ "SELECT COUNT(*) FROM information_schema.columns "
+ f"WHERE table_schema = '{self._config.database}' AND table_name = '{self._config.table}'"
+ )
def connector(self): | ||
return self._connector | ||
|
||
@connector.setter | ||
def connector(self, connector: BaseConnector): | ||
self._connector = connector | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connector
property setter is not validating the input before assigning it to self._connector
. This could lead to unexpected behavior if an invalid connector is passed. Consider adding validation logic to ensure that connector
is a valid instance of BaseConnector
before assigning it to self._connector
.
pandasai/smart_dataframe/__init__.py
Outdated
*middlewares: A list of middlewares | ||
|
||
""" | ||
self._dl.add_middlewares(*middlewares) | ||
self.lake.add_middlewares(*middlewares) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add_middlewares
method is delegating the call to self.lake.add_middlewares
, but there's no check to ensure that self.lake
is not None. This could lead to a NoneType
error if self.lake
is None. Consider adding a check to ensure that self.lake
is not None before calling its add_middlewares
method.
Raises: | ||
ValueError: If the query is empty | ||
""" | ||
return self._dl.chat(query) | ||
return self.lake.chat(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chat
method is delegating the call to self.lake.chat
, but there's no check to ensure that self.lake
is not None. This could lead to a NoneType
error if self.lake
is None. Consider adding a check to ensure that self.lake
is not None before calling its chat
method.
def column_hash(self) -> str: | ||
columns_str = "".join(self._df.columns) | ||
""" | ||
Get the hash of the columns of the dataframe. | ||
|
||
Returns: | ||
str: Hash of the columns of the dataframe | ||
""" | ||
if not self._core._df_loaded and self.connector: | ||
return self.connector.column_hash | ||
|
||
columns_str = "".join(self.dataframe.columns) | ||
hash_object = hashlib.sha256(columns_str.encode()) | ||
return hash_object.hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column_hash
method is not handling the case where self.connector
is None, which could lead to an AttributeError when trying to access self.connector.column_hash
. Consider adding a check to ensure that self.connector
is not None before accessing its column_hash
attribute.
def save(self, name: str = None): | ||
""" | ||
Saves the dataframe configuration to be used for later | ||
|
||
Args: | ||
name (str, optional): Name of the dataframe configuration. Defaults to None. | ||
""" | ||
|
||
config_manager = DfConfigManager(self) | ||
config_manager.save(name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The save
method is creating a new instance of DfConfigManager
every time it's called. If creating a DfConfigManager
is expensive in terms of resources, consider reusing the same instance across multiple calls to save
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking forward to the new release!
When it's going to release? excited for it! |
* refactor: `to_dict()` parameters * (refactor): update behaviour of `to_dict()` method, make it possible to accept `into` and `as_series` parameters (the last one is for polars dataframes). * (tests): add tests for casting the dataframe to a dictionary, add tests for passing parameters in the proxy-call to dataframe's `to_dict()` method. * docs: docstrings for `DataframeAbstract` * (docs): add docstrings for proxy-methods in `DataframeAbstract` class * chore: type hint issue * (chore): update return type for `df_type` function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Files selected for processing (3)
- pandasai/helpers/df_info.py (1 hunks)
- pandasai/smart_dataframe/abstract_df.py (1 hunks)
- tests/test_smartdataframe.py (13 hunks)
Additional comments (Suppressed): 11
pandasai/helpers/df_info.py (1)
- 14-17: The function signature has been changed from returning
str
toUnion[str, None]
. Ensure that all calls to this function throughout the codebase can handle aNone
return value.pandasai/smart_dataframe/abstract_df.py (1)
- 1-4: The class
DataframeAbstract
now inherits from theABC
(Abstract Base Class) module. This is a good practice for defining abstract base classes in Python, as it provides a mechanism to define methods that must be created within any child classes built from the abstract class.tests/test_smartdataframe.py (9)
108-119: The new fixture
smart_dataframe_mocked_df
is a good addition as it allows for testing of theSmartDataframe
with a mocked dataframe. This can help isolate tests and make them more reliable.130-137: The test
test_init
has been updated to reflect changes in theSmartDataframe
class. The assertions now check for_table_name
and_table_description
instead of_name
and_description
. Also,engine
anddataframe
are checked instead of_engine
and_df
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.223-262: The new tests
test_to_dict
andtest_to_dict_passing_parameters
are well written and cover important functionality of theto_dict
method. They ensure that the method returns a dictionary with expected keys and that it correctly passes parameters based on the engine type.380-386: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [380-413]
The test
test_error_prompt_replacement
has been updated to reflect changes in theSmartDataframe
class. The dataframe is now initialized with actual data and thelake._retry_run_code
method is called instead of_dl._retry_run_code
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.
420-432: The test
test_logs
has been updated to reflect changes in theSmartDataframe
class. Thelake.logger.log
method is called instead of_dl._logger.log
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.439-453: The test
test_updates_verbose_config_with_setters
has been updated to reflect changes in theSmartDataframe
class. Thelake._logger.verbose
property is checked instead of_dl._logger.verbose
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.460-471: The test
test_updates_save_logs_config_with_setters
has been updated to reflect changes in theSmartDataframe
class. Thelake._logger.save_logs
property is checked instead of_dl._logger.save_logs
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.477-487: The test
test_updates_enable_cache_config_with_setters
has been updated to reflect changes in theSmartDataframe
class. Thelake.enable_cache
property andlake._cache
object are checked instead of_dl.enable_cache
and_dl._cache
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.537-557: The tests for loading dataframes (
test_load_dataframe_from_list_of_dicts
,test_load_dataframe_from_dict
,test_load_dataframe_from_pandas_dataframe
,test_load_dataframe_from_saved_dfs
,test_load_dataframe_from_other_dataframe_type
) have been updated to reflect changes in theSmartDataframe
class. The_load_dataframe
method is called instead of_load_df
and thedataframe
property is checked instead of_df
. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.
_engine: str | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _engine
attribute has been added to the class. It seems to be used to differentiate between different DataFrame implementations (e.g., pandas, polars). However, there's no default value or type hinting provided for this attribute. It would be better to provide a default value or at least a comment explaining what values are expected for this attribute.
- _engine: str
+ _engine: str = None # Expected values: 'pandas', 'polars'
def to_dict(self, orient="dict", into=dict, as_series=True): | ||
""" | ||
A proxy-call to the dataframe's `.to_dict()`. | ||
""" | ||
if self._engine == "pandas": | ||
return self.dataframe.to_dict(orient=orient, into=into) | ||
elif self._engine == "polars": | ||
return self.dataframe.to_dict(as_series=as_series) | ||
raise RuntimeError( | ||
f"{self.__class__} object has unknown engine type. " | ||
f"Possible engines: 'pandas', 'polars'. Actual '{self._engine}'." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The to_dict
method has been updated to handle different DataFrame engines. This is a good approach to make the class more flexible and adaptable to different DataFrame implementations. However, the error message in the RuntimeError
could be improved by including the actual engine type.
- raise RuntimeError(
- f"{self.__class__} object has unknown engine type. "
- f"Possible engines: 'pandas', 'polars'. Actual '{self._engine}'."
- )
+ raise RuntimeError(f"Unknown engine type '{self._engine}' for {self.__class__}. "
+ "Expected engine types are 'pandas' or 'polars'.")
Overall, the changes made in this PR seem to improve the flexibility and adaptability of the DataframeAbstract
class. The addition of the _engine
attribute allows the class to handle different DataFrame implementations, and the use of proxy-calls to DataFrame methods ensures that the class can work with any DataFrame that implements these methods. The use of the ABC
module is also a good practice for defining abstract base classes in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 7
Files ignored due to filter (1)
- pyproject.toml
Files selected for processing (10)
- pandasai/helpers/code_manager.py (9 hunks)
- pandasai/helpers/memory.py (1 hunks)
- pandasai/prompts/base.py (1 hunks)
- pandasai/prompts/generate_python_code.py (2 hunks)
- pandasai/smart_dataframe/init.py (8 hunks)
- pandasai/smart_datalake/init.py (16 hunks)
- tests/prompts/test_correct_error_prompt.py (2 hunks)
- tests/prompts/test_generate_python_code_prompt.py (2 hunks)
- tests/test_smartdataframe.py (14 hunks)
- tests/test_smartdatalake.py (4 hunks)
Files skipped from review due to trivial changes (3)
- pandasai/prompts/base.py
- pandasai/prompts/generate_python_code.py
- tests/prompts/test_generate_python_code_prompt.py
Additional comments (Suppressed): 50
tests/prompts/test_correct_error_prompt.py (1)
- 15-27: The
FakeLLM
andSmartDataframe
objects are created but not used in the test. If they are necessary for the test, please ensure that they are utilized correctly. Otherwise, consider removing them to reduce unnecessary complexity.tests/test_smartdatalake.py (4)
72-77: The fixture
smart_datalake
has been updated to use the new attributelake
instead of the olddatalake
. Ensure that all references to this fixture in the test suite have been updated accordingly.118-124: The middleware setup and assertion have been updated to use the new
lake
attribute. This change is consistent with the update in thesmart_datalake
fixture.131-142: The mock method
_get_sample_head
returns a DataFrame directly, replacing the previous_get_head_csv
which returned a string representation of a CSV. This change simplifies the code and makes it more efficient by avoiding unnecessary conversions between formats.149-164: The format of
last_prompt
has been updated to include additional information about the DataFrame(s) being used. This change improves the clarity of the prompt for users. However, ensure that this new format is correctly handled whereverlast_prompt
is used.pandasai/smart_datalake/__init__.py (10)
76-83: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
92-96: Similar to the previous comment, the logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
129-133: Again, the logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
189-192: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
267-268: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
280-281: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
303-309: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
345-346: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
400-402: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
464-473: The logger is now directly assigned to the instance instead of being stored in a private variable. This change should be verified across the codebase to ensure that all references to the logger have been updated accordingly.
pandasai/helpers/code_manager.py (4)
1-11: The import statements are well organized and necessary for the functionality of the code. The addition of
collections
andnode_visitors
is justified by their usage in the new hunks.15-21: The addition of
Generator
from thetyping
module is justified as it's used in the function_tokenize_operand
.28-45: The addition of
_ast_comparatos_map
and_current_code_executed
attributes to the class seems to be necessary for the new functionality added to the class. However, please ensure that these attributes are properly initialized and cleaned up where necessary.295-302: The removal of "dfs" from the environment dictionary is justified as it's replaced by the use of
_get_samples
method in theexecute_code
method.tests/test_smartdataframe.py (13)
- 2-8: The import of
collections
module is not used in the new hunks. Please ensure that it's used elsewhere in the code or remove it if it's unnecessary.- from collections import defaultdict
108-119: The fixture
smart_dataframe_mocked_df
has been added. Ensure that this fixture is being used in the tests and that it provides the necessary mock setup for the tests.130-137: The test
test_init
andtest_init_without_llm
have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.181-217: The expected prompt in the test
test_last_prompt_with_privacy_enforced
has been updated. Make sure that this change reflects the actual output of the function under test.221-264: The test
test_to_dict_passing_parameters
has been added. This test checks whether the correct parameters are passed to theto_dict
method depending on the engine type. Ensure that this test covers all possible scenarios and edge cases.381-387: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [381-414]
The test
test_retry_run_code_with_custom_prompt
has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.
421-433: The test
test_log
has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.438-454: The test
test_updates_verbose_config_with_setters
has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.460-474: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [460-488]
The test
test_updates_enable_cache_config_with_setters
has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.
490-491: The test
test_updates_configs_with_setters
has been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.538-559: The tests
test_load_dataframe_from_list_of_dicts
,test_load_dataframe_from_dict
,test_load_dataframe_from_pandas_dataframe
, andtest_load_dataframe_from_saved_dfs
have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.575-591: The tests
test_load_dataframe_from_other_dataframe_type
andtest_import_csv_file
have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.637-647: The tests
test_import_invalid_file
andtest_import_pandas_series
have been updated to reflect changes in the class properties. Ensure that these changes are consistent with the updates made to theSmartDataframe
class.pandasai/smart_dataframe/__init__.py (18)
22-28: The import of
cached_property
fromfunctools
is new but it's not used anywhere in the provided hunks. Please ensure that it's used somewhere else in the code or remove it if it's unnecessary.41-59: The class
SmartDataframeCore
has been introduced which seems to handle the core functionality of loading and managing dataframes. This is a good practice as it separates concerns and makes the code more maintainable.65-93: The
_load_dataframe
method has been refactored to handle different types of inputs including connectors, strings (file paths), pandas Series, lists, and dictionaries. The error handling for invalid input data is also well implemented.41-95: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [94-115]
The
_import_from_file
method is missing in the new hunk. If this method is defined elsewhere in the code, please ensure that it correctly handles file imports and raises an appropriate error for invalid file formats.
129-151: The
_validate_and_convert_dataframe
method is a good addition as it encapsulates the logic for validating and converting the dataframe input into a single function. This improves the readability and maintainability of the code.153-174: The methods
load_connector
and_unload_connector
are well implemented to manage the loading and unloading of connectors. Good use of instance variables to track the state of the loaded dataframe.176-192: The getter and setter for the
dataframe
property have been well implemented. They handle the loading and unloading of the connector when necessary and validate and convert the dataframe before setting it.207-220: The properties
engine
,connector
, andhas_connector
are well implemented and provide a clean interface for accessing these attributes.223-305: The
SmartDataframe
class has been refactored to use theSmartDataframeCore
for handling the core functionality. This is a good practice as it separates concerns and makes the code more maintainable.313-317: The
add_middlewares
method now delegates to thelake
attribute instead of the previously used_dl
attribute. Please ensure that thelake
attribute is correctly initialized and has theadd_middlewares
method.325-328: The
chat
method now delegates to thelake
attribute instead of the previously used_dl
attribute. Please ensure that thelake
attribute is correctly initialized and has thechat
method.330-342: The
column_hash
method has been updated to handle the case where the dataframe is not loaded and a connector is present. This is a good improvement as it ensures that the column hash can still be obtained in this scenario.344-354: The
save
method remains unchanged and continues to use theDfConfigManager
for saving the dataframe configuration. This is consistent with the previous implementation.355-364: The
load_connector
method now delegates to the_core
attribute. This is consistent with the new design of usingSmartDataframeCore
for handling the core functionality.365-403: The
_truncate_head_columns
method is a new addition and it seems to handle the truncation of column values well for both pandas and polars dataframes. However, please ensure that thepolars
package is installed if it's required by your project.405-419: The
_get_sample_head
method has been updated to handle the case where the dataframe is not loaded and a connector is present. This is a good improvement as it ensures that a sample head can still be obtained in this scenario.325-423: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [421-427]
The
_load_from_config
method remains unchanged and continues to use theDfConfigManager
for loading the saved dataframe configuration. This is consistent with the previous implementation.
- 429-445: The properties
dataframe
,engine
,connector
, and the setter forconnector
now delegate to the_core
attribute. This is consistent with the new design of usingSmartDataframeCore
for handling the core functionality.
prompt.to_string() | ||
== """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method to_string()
is called on the prompt
object, but it seems like this should be __str__()
, based on the old hunk and the description of the test (test_str_with_args
). Please verify if this is a typo or an intended change.
- prompt.to_string()
+ str(prompt)
def get_conversation(self, limit: int = 1) -> str: | ||
return "\n".join( | ||
[ | ||
f"{'User' if message['is_user'] else 'Bot'}: {message['message']}" | ||
for message in self._messages[-limit:] | ||
f"{f'User {i+1}' if message['is_user'] else f'Assistant {i}'}: " | ||
f"{message['message']}" | ||
for i, message in enumerate(self._messages[-limit:]) | ||
] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation of get_conversation
method introduces numbering for the User and Assistant messages. However, there's a potential issue with the numbering logic. The enumeration starts from 0, so the first user message will be "User 1" but the first assistant message will be "Assistant 0". To keep consistency, both should start from 1.
- f"{f'User {i+1}' if message['is_user'] else f'Assistant {i}'}: "
+ f"{f'User {i+1}' if message['is_user'] else f'Assistant {i+1}'}: "
if not use_error_correction_framework: | ||
raise exc | ||
|
||
def _required_dfs(self, code: str) -> List[str]: | ||
""" | ||
List the index of the DataFrames that are needed to execute the code. The goal | ||
is to avoid to run the connectors if the code does not need them. | ||
|
||
Args: | ||
code (str): Python code to execute | ||
|
||
Returns: | ||
List[int]: A list of the index of the DataFrames that are needed to execute | ||
the code. | ||
""" | ||
|
||
required_dfs = [] | ||
for i, df in enumerate(self._dfs): | ||
if f"dfs[{i}]" in code: | ||
required_dfs.append(df) | ||
else: | ||
required_dfs.append(None) | ||
return required_dfs | ||
|
||
def execute_code( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of the _required_dfs
method is a good improvement as it helps to avoid unnecessary execution of connectors when they are not needed by the code. This can potentially improve performance.
on the generated code. | ||
|
||
""" | ||
self._current_code_executed = code | ||
|
||
for middleware in self._middlewares: | ||
code = middleware(code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignment of self._current_code_executed = code
before the middleware loop is a good practice as it allows the current executing code to be tracked and used in other methods.
```""" | ||
) | ||
|
||
# List the required dfs, so we can avoid to run the connectors | ||
# if the code does not need them | ||
dfs = self._required_dfs(code_to_run) | ||
environment: dict = self._get_environment() | ||
environment["dfs"] = self._get_samples(dfs) | ||
|
||
caught_error = self._execute_catching_errors(code_to_run, environment) | ||
if caught_error is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of _required_dfs
in the execute_code
method is a good improvement as it helps to avoid unnecessary execution of connectors when they are not needed by the code. This can potentially improve performance.
|
||
analyze_data = environment.get("analyze_data", None) | ||
|
||
return analyze_data(self._get_original_dfs()) | ||
return analyze_data(self._get_originals(dfs)) | ||
|
||
def _get_samples(self, dfs): | ||
""" | ||
Get samples from the dfs | ||
|
||
Args: | ||
dfs (list): List of dfs | ||
|
||
def _get_original_dfs(self): | ||
dfs = [] | ||
for df in self._dfs: | ||
if df.engine == "polars": | ||
dfs.append(df.original.to_pandas()) | ||
Returns: | ||
list: List of samples | ||
""" | ||
samples = [] | ||
for df in dfs: | ||
if df is not None: | ||
samples.append(df.head_df) | ||
else: | ||
dfs.append(df.original) | ||
samples.append(None) | ||
return samples | ||
|
||
return dfs | ||
def _get_originals(self, dfs): | ||
""" | ||
Get original dfs | ||
|
||
Args: | ||
dfs (list): List of dfs | ||
|
||
Returns: | ||
list: List of dfs | ||
""" | ||
original_dfs = [] | ||
for index, df in enumerate(dfs): | ||
if df is None: | ||
original_dfs.append(None) | ||
continue | ||
|
||
if df.has_connector: | ||
extracted_filters = self._extract_filters(self._current_code_executed) | ||
filters = extracted_filters.get(f"dfs[{index}]", []) | ||
df.connector.set_additional_filters(filters) | ||
df.load_connector(temporary=len(filters) > 0) | ||
|
||
original_dfs.append(df.dataframe) | ||
|
||
return original_dfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of _get_samples
and _get_originals
methods is a good improvement as it provides a way to get samples and original dataframes from the list of dataframes. This can potentially improve readability and maintainability of the code.
} | ||
""" | ||
comparisons = defaultdict(list) | ||
current_df = "dfs[0]" | ||
|
||
visitor = AssignmentVisitor() | ||
visitor.visit(tree) | ||
assignments = visitor.assignment_nodes | ||
|
||
call_visitor = CallVisitor() | ||
call_visitor.visit(tree) | ||
calls = call_visitor.call_nodes | ||
|
||
for node in ast.walk(tree): | ||
if isinstance(node, ast.Compare): | ||
is_call_on_left = isinstance(node.left, ast.Call) | ||
is_polars = False | ||
is_calling_col = False | ||
try: | ||
is_polars = node.left.func.value.id in ("pl", "polars") | ||
is_calling_col = node.left.func.attr == "col" | ||
except AttributeError: | ||
pass | ||
|
||
if is_call_on_left and is_polars and is_calling_col: | ||
df_name = self._get_nearest_func_call( | ||
node.lineno, calls, "filter" | ||
).func.value.id | ||
current_df = self._get_df_id_by_nearest_assignment( | ||
node.lineno, assignments, df_name | ||
) | ||
left_str = node.left.args[0].value | ||
|
||
for op, right in zip(node.ops, node.comparators): | ||
op_str = self._ast_comparatos_map.get(type(op), "Unknown") | ||
right_str = right.value | ||
|
||
comparisons[current_df].append((left_str, op_str, right_str)) | ||
elif isinstance(node.left, ast.Subscript): | ||
name, *slices = self._tokenize_operand(node.left) | ||
current_df = ( | ||
self._get_df_id_by_nearest_assignment( | ||
node.lineno, assignments, name | ||
) | ||
or current_df | ||
) | ||
left_str = name if not slices else slices[-1] | ||
|
||
for op, right in zip(node.ops, node.comparators): | ||
op_str = self._ast_comparatos_map.get(type(op), "Unknown") | ||
name, *slices = self._tokenize_operand(right) | ||
right_str = name if not slices else slices[-1] | ||
|
||
comparisons[current_df].append((left_str, op_str, right_str)) | ||
return comparisons | ||
|
||
def _extract_filters(self, code) -> dict[str, list]: | ||
""" | ||
Extract filters to be applied to the dataframe from passed code. | ||
|
||
Args: | ||
code (str): A snippet of code to be parsed. | ||
|
||
Returns: | ||
dict: The dictionary containing all filters parsed from | ||
the passed code. The dictionary has the following structure: | ||
{ | ||
"<df_number>": [ | ||
("<left_operand>", "<operator>", "<right_operand>") | ||
] | ||
} | ||
|
||
Raises: | ||
SyntaxError: If the code is unable to be parsed by `ast.parse()`. | ||
Exception: If any exception is raised during working with nodes | ||
of the code tree. | ||
""" | ||
try: | ||
parsed_tree = ast.parse(code) | ||
except SyntaxError: | ||
self._logger.log( | ||
"Invalid code passed for extracting filters", level=logging.ERROR | ||
) | ||
self._logger.log(f"{traceback.format_exc()}", level=logging.DEBUG) | ||
raise | ||
|
||
try: | ||
filters = self._extract_comparisons(parsed_tree) | ||
except Exception: | ||
self._logger.log( | ||
"Unable to extract filters for passed code", level=logging.ERROR | ||
) | ||
self._logger.log(f"{traceback.format_exc()}", level=logging.DEBUG) | ||
raise | ||
|
||
return filters | ||
|
||
@property | ||
def middlewares(self): | ||
return self._middlewares |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of several utility methods and properties like _get_nearest_func_call
, _tokenize_operand
, _get_df_id_by_nearest_assignment
, _extract_comparisons
, _extract_filters
is a good improvement as it provides a way to extract filters from the code, tokenize operands, get dataframe id by nearest assignment, etc. This can potentially improve readability and maintainability of the code.
Summary by CodeRabbit
SQLConnector
,MySQLConnector
, andPostgreSQLConnector
classes for connecting to SQL databases.YahooFinanceConnector
class for retrieving stock data from Yahoo Finance.SmartDataframe
class to improve modularity and maintainability. Introduced a newSmartDataframeCore
class that handles the loading and management of the dataframe.CodeManager
class with additional methods for better code execution and filter extraction.SQLConnector
,CodeManager
, andSmartDataframe
classes.