-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add output_type
parameter (#519)
#562
feat: add output_type
parameter (#519)
#562
Conversation
* (feat): update prompt template in `GeneratePythonCodePrompt`, add `output_type_hint` variable to be interpolated * (feat): update `.chat()` method for `SmartDataframe` and `SmartDatalake`, add optional `output_type` parameter * (feat): add `get_output_type_hint()` in `GeneratePythonCodePrompt` class * (feat): add "output_type_hint" to `default_values` when forming prompt's template context * (tests): update tests in `TestGeneratePythonCodePrompt` * (tests): add tests for checking `output_type` interpotaion to a prompt
* (refactor): update setting value for `{output_type_hint}` in prompt class
There are some tests failed also, will check it ASAP. |
* (tests): fix error in `TestGeneratePythonCodePrompt` with confused actual prompt's content and excepted prompt's content (which led to tests being failed) * (refactor): update test method `test_str_with_args()` with `parametrize` decorator, remove duplication of code (DRY)
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## main #562 +/- ##
==========================================
+ Coverage 83.42% 83.66% +0.23%
==========================================
Files 52 54 +2
Lines 2594 2699 +105
==========================================
+ Hits 2164 2258 +94
- Misses 430 441 +11
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Tests are passed. I'd like to keep this in WIP state for a while to complete this option:
|
* (tests): parametrizing `test_run_passing_output_type()` with different output types.
* (refactor): move output types and their hints to a separate python module * (feat): validation for inappropriate output type and value * (tests): update tests accordingly, add a new test method to check if the message about incorrect output type is added to logs
@gventuri looking forward for your feedback :) Refactored the code a bit.
I couldn't manage to create the correction logic yet, unfortunately; i'd like to add this within next PR's if it's possible. Sounds like it should be an addition to existing Error Correction Framework. |
WalkthroughThis update introduces a flexible output type handling system in the pandasai library. It adds an output type factory function and several output type classes, allowing users to specify the desired output type for data analysis results. The changes also include updates to the Changes
TipsChat with CodeRabbit Bot (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 7
Files selected for processing (8)
- pandasai/helpers/output_types/init.py (1 hunks)
- pandasai/helpers/output_types/_output_types.py (1 hunks)
- pandasai/llm/huggingface_text_gen.py (3 hunks)
- pandasai/prompts/generate_python_code.py (2 hunks)
- pandasai/smart_dataframe/init.py (1 hunks)
- pandasai/smart_datalake/init.py (9 hunks)
- tests/prompts/test_generate_python_code_prompt.py (3 hunks)
- tests/test_smartdataframe.py (2 hunks)
Files skipped from review due to trivial changes (1)
- tests/prompts/test_generate_python_code_prompt.py
Additional comments (Suppressed): 20
pandasai/helpers/output_types/__init__.py (1)
- 11-16: The
output_types_map
dictionary is a good way to map the string representation of the output type to the corresponding class. This makes it easy to add or remove output types in the future.tests/test_smartdataframe.py (3)
14-21: The new imports from
pandasai.helpers.output_types
are correctly added to support the new output type functionality. Theoutput_type_factory
andoutput_types_map
will be used in the tests to create instances of output types and validate their behavior.226-282: This test case (
test_run_passing_output_type
) is checking if theoutput_type
parameter is correctly passed to thechat()
method and if it affects the prompt as expected. It's a good practice to ensure that the new feature works as intended.284-316: The test case
test_run_incorrect_output_type_returned
checks whether an error is logged when the returned output type does not match the expected output type. This is a good way to ensure that the system behaves correctly when encountering unexpected values.pandasai/helpers/output_types/_output_types.py (5)
30-64: The
validate
method in theBaseOutputType
class is well implemented. It checks both the type and value of the result, providing detailed error messages if either check fails. This will be very helpful for debugging.67-81: The
NumberOutputType
class correctly validates that the actual value is a number (int, float, or Decimal). Good use of Python's built-inisinstance
function to check the data type.83-97: The
DataFrameOutputType
class correctly validates that the actual value is a DataFrame (either pandas or polars). This is a good way to ensure that the output matches the expected format.120-134: The
StringOutputType
class correctly validates that the actual value is a string. This is a straightforward and effective way to ensure that the output matches the expected format.136-160: The
DefaultOutputType
class does not perform any validation on the type or value of the result. This is appropriate given its role as a fallback when no specific output type is specified.pandasai/llm/huggingface_text_gen.py (3)
14-20: No change in the code logic or functionality. The only difference is in the formatting of the
temperature
attribute's default value, which has been changed from1E-3
to1e-3
. Both represent the same value and this change does not affect the program execution.29-35: No change in the code logic or functionality. The only difference is in the formatting of the
for
loop where parentheses aroundkey, val
have been removed. This is a style preference and does not affect the program execution.76-81: No change in the code logic or functionality. The code remains identical to the previous version.
pandasai/prompts/generate_python_code.py (1)
- 60-66: The new hunk has replaced the hardcoded output types with a variable
output_type_hint
. This is a good change as it makes the code more flexible and adaptable to different output types. However, ensure that theoutput_type_hint
variable is properly initialized and assigned a valid value before being used here.pandasai/smart_dataframe/__init__.py (1)
- 318-339: The function signature of
chat
has been changed to include an optionaloutput_type
parameter. Ensure that all calls to this function throughout the codebase have been updated to match the new signature, or can handle the new optional parameter.pandasai/smart_datalake/__init__.py (6)
27-30: The import of
output_type_factory
is a new addition to the code. Ensure that this function is correctly implemented and tested, as it will be used to create instances of output type classes based on the providedoutput_type
parameter.207-211: The method signature for
_get_prompt()
has been updated to include anoutput_type_hint
parameter. This change should be reflected wherever this method is called in the codebase.227-234: The instantiation of the
prompt
object now includes theoutput_type_hint
parameter. Make sure that all prompt classes can handle this new parameter.262-283: The
chat()
method now accepts anoutput_type
parameter. This is a significant change and needs to be thoroughly tested. Also, ensure that all calls tochat()
throughout the codebase have been updated to match the new signature.296-297: The
output_type_helper
is created using theoutput_type_factory()
. Ensure that the factory function correctly handles all possible values ofoutput_type
.314-318: The
output_type_hint
is now passed to thegenerate_code()
method of the LLM. Ensure that the LLM can handle this new hint and generates appropriate code based on it.
class PlotOutputType(BaseOutputType): | ||
@property | ||
def template_hint(self): | ||
return """- type (must be "plot") | ||
- value (must be a string containing the path of the plot image)""" | ||
|
||
@property | ||
def name(self): | ||
return "plot" | ||
|
||
def _validate_value(self, actual_value: Any) -> bool: | ||
if not isinstance(actual_value, str): | ||
return False | ||
|
||
path_to_plot_pattern = r"^(\/[\w.-]+)+(/[\w.-]+)*$|^[^\s/]+(/[\w.-]+)*$" | ||
if re.match(path_to_plot_pattern, actual_value): | ||
return True | ||
|
||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PlotOutputType
class uses a regular expression to validate that the actual value is a string representing a valid file path. This is a good approach, but it might be worth considering whether there are any edge cases that this regex doesn't cover. For example, it doesn't seem to handle spaces in file paths.
Updated code: | ||
""" # noqa: E501 | ||
|
||
def __init__(self): | ||
def __init__(self, **kwargs): | ||
default_import = "import pandas as pd" | ||
engine_df_name = "pd.DataFrame" | ||
output_type_hint = kwargs.pop("output_type_hint") | ||
|
||
self.set_var("default_import", default_import) | ||
self.set_var("engine_df_name", engine_df_name) | ||
self.set_var("output_type_hint", output_type_hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __init__()
method now accepts keyword arguments and extracts the output_type_hint
from them. This is a good practice as it allows for greater flexibility when initializing instances of this class. However, there's no error handling in case output_type_hint
is not provided in kwargs
. It would be better to provide a default value or raise an appropriate exception if it's missing.
- output_type_hint = kwargs.pop("output_type_hint")
+ output_type_hint = kwargs.pop("output_type_hint", None)
+ if output_type_hint is None:
+ raise ValueError("`output_type_hint` must be provided.")
Args: | ||
query (str): Query to run on the dataframe | ||
output_type (Optional[str]): Add a hint for LLM of which | ||
type should be returned by `analyze_data()` in generated | ||
code. Possible values: "number", "dataframe", "plot", "string": | ||
* number - specifies that user expects to get a number | ||
as a response object | ||
* dataframe - specifies that user expects to get | ||
pandas/polars dataframe as a response object | ||
* plot - specifies that user expects LLM to build | ||
a plot | ||
* string - specifies that user expects to get text | ||
as a response object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output_type
parameter is a string that can take one of four values: "number", "dataframe", "plot", "string". It might be better to use an Enum for this instead of a string. This would make it easier to manage and validate these options, and would also provide better autocompletion support in IDEs.
- output_type: Optional[str] = None
+ output_type: Optional[OutputType] = None
Where OutputType
is an Enum:
from enum import Enum
class OutputType(Enum):
NUMBER = "number"
DATAFRAME = "dataframe"
PLOT = "plot"
STRING = "string"
Raises: | ||
ValueError: If the query is empty | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no validation for the output_type
parameter. If an invalid value is passed, it could lead to unexpected behavior or errors. Consider adding validation to check if the provided output_type
is one of the expected values.
if output_type and output_type not in OutputType.values():
raise ValueError(f"Invalid output type: {output_type}. Expected one of {OutputType.values()}")
pandasai/smart_datalake/__init__.py
Outdated
self._config.callback.on_code(code) | ||
return code | ||
|
||
# def _check_output_type(self, result, excepted_output_type): | ||
# actual_output_type = result.get('type') | ||
# if actual_output_type != excepted_output_type: | ||
# self.logger.log( | ||
# message=f"Inappropriate type of the result returned by LLM!\n" | ||
# f"Actual type: '{actual_output_type}'\n" | ||
# f"Expected type: '{excepted_output_type}'" | ||
# ) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are commented lines of code. If these lines are not needed, consider removing them to maintain code cleanliness.
* (chore): remove unused lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Files selected for processing (1)
- pandasai/smart_datalake/init.py (8 hunks)
Additional comments (Suppressed): 5
pandasai/smart_datalake/__init__.py (5)
27-30: The import of
output_type_factory
from..helpers.output_types
is new. Ensure that the module and function exist and are accessible from this location.210-211: The method signature for
_get_prompt()
has been updated to include a new parameteroutput_type_hint
. Make sure all calls to this method have been updated accordingly.262-283: The
chat()
method now accepts an optionaloutput_type
argument. This is a significant change as it allows users to specify the expected output type of the query. Ensure that this change does not break existing functionality whereoutput_type
is not provided.296-297: The
output_type_factory
function is used to create an instance of an output type helper based on the providedoutput_type
argument. Ensure that this function correctly handles all possible values ofoutput_type
, includingNone
.369-373: The result of the query is now validated against the expected output type using the
validate()
method of the output type helper. This is a good practice as it ensures that the result matches the user's expectations. However, ensure that thevalidate()
method correctly handles all possible types and values of results.
pandasai/smart_datalake/__init__.py
Outdated
output_type_hint=output_type_helper.template_hint, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output_type_hint
is now passed to the _get_prompt()
method. This change assumes that all prompts will accept this argument. If there are any prompts that do not accept this argument, this will cause an error.
- default_values=default_values,
+ default_values=default_values,
+ output_type_hint=output_type_helper.template_hint,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, only a couple of minor notes!
return "dataframe" | ||
|
||
def _validate_value(self, actual_value: Any) -> bool: | ||
if isinstance(actual_value, (pd.DataFrame, pl.DataFrame)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use df_config.df_type like this:
if df_type(actual_value) is not None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also allows you not having to import polars and pandas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, since this releases the module from having excessive imports of third party.
Done
tests/test_smartdataframe.py
Outdated
- value (must be a number) | ||
Example output: { "type": "text", "value": "The average loan amount is $15,000." } | ||
{output_type_hint} | ||
Example output: {{ "type": "text", "value": "The average loan amount is $15,000." }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe even the example output might change based on the type so that we don't risk hallucinating the LLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Positive. Really great notice, this might have turned a dull confusion for the AI. I must just lost sight of that
Done
pandasai/smart_datalake/__init__.py
Outdated
@@ -206,6 +207,7 @@ def _get_prompt( | |||
key: str, | |||
default_prompt: Type[Prompt], | |||
default_values: Optional[dict] = None, | |||
output_type_hint: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pass it as part of default values in the generate python code
prompt only, so that this is accessible both on the default prompt and in the custom prompts. This might not be needed in every prompt in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you're right, it makes the additional context to be in one place default_values
.
Apologize for spreading the setting logic like this, just we've got rly complicated implementation in ._get_prompt()
as for me. As i've mentioned: some variables're being set in __init__()
, the rest are being set in _get_prompt()
itself, when iterating over default_values
. That was kind a tricky moment when i'd had output types mapping inside prompt class before.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nautics889 yeah, totally, it's super involved, we'll definitely need to simplify that a little bit in the future 😄
* (refactor): pass `output_type_hint` in the `default_values`, don't bother with setting this template variable in prompt's `__init__()` * (fix): correct templates examples, add them as a part of `output_type_hint` * (refactor): use `df_type()` (utility functions) to get rid from imports of third packages in _output_types.py * (refactor): add logging to the factory-function `output_type_factory()` to enhance verbosity for behaviour inspection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Files selected for processing (6)
- pandasai/helpers/output_types/init.py (1 hunks)
- pandasai/helpers/output_types/_output_types.py (1 hunks)
- pandasai/prompts/generate_python_code.py (2 hunks)
- pandasai/smart_datalake/init.py (6 hunks)
- tests/prompts/test_generate_python_code_prompt.py (3 hunks)
- tests/test_smartdataframe.py (3 hunks)
Files skipped from review due to trivial changes (2)
- pandasai/prompts/generate_python_code.py
- tests/prompts/test_generate_python_code_prompt.py
Additional comments (Suppressed): 19
pandasai/helpers/output_types/__init__.py (4)
14-19: The
output_types_map
dictionary is a good way to map string identifiers to their corresponding classes. This makes it easy to add or remove output types in the future without having to modify the factory function.22-52: The
output_type_factory
function is well designed and follows the Factory Method pattern, which provides an interface for creating objects in a superclass, but allows subclasses to alter the type of objects that will be created. This design pattern is particularly useful when a class can't anticipate the class of objects it must create. Here, the factory method is used to create an instance of the appropriate output type based on theoutput_type
parameter.53-60: Good use of logging to warn the user when an unknown
output_type
is passed. This helps in debugging and understanding any issues related to incorrectoutput_type
values.62-69: The factory function correctly defaults to
DefaultOutputType
when the providedoutput_type
is not found in theoutput_types_map
. This is a good fallback mechanism.tests/test_smartdataframe.py (4)
14-24: The new imports
DefaultOutputType
,output_types_map
, andoutput_type_factory
frompandasai.helpers.output_types
are necessary for the new functionality of specifying output types. Ensure that these classes and functions are implemented correctly.210-222: The docstring has been updated to reflect the new possible values for the 'type' key in the returned dictionary. The previous value "text" has been replaced with "string", which is more accurate. Also, examples for each type have been provided, which improves clarity.
233-288: This test case checks whether the
output_type_hint
is correctly interpolated into the prompt template when differentoutput_type
values are passed to thechat()
method. It's a good practice to test such dynamic behavior.290-322: This test case checks if a warning log is generated when there's a mismatch between the user-specified output type and the actual output type returned by the code generated by LLM. This is an important test to ensure that the system behaves as expected in case of such mismatches.
pandasai/helpers/output_types/_output_types.py (6)
1-4: The imports are well organized and only necessary modules are imported. Good job on keeping the import section clean.
9-64: The
BaseOutputType
class is well designed with abstract methods and properties that enforce a consistent interface for all subclasses. This will make it easier to add new output types in the future.66-81: The
NumberOutputType
class correctly validates whether the actual value is of type int, float, or Decimal. This ensures that the output type matches the expected type.83-96: The
DataFrameOutputType
class uses thedf_type
function to validate if the actual value is a pandas DataFrame. However, thedf_type
function is not defined in this file. Make sure it's imported from the correct module and works as expected.120-134: The
StringOutputType
class correctly validates whether the actual value is a string. This ensures that the output type matches the expected type.48-63: The
validate
method in theBaseOutputType
class correctly checks both the type and value of the result, and logs any mismatches. This is a good practice for debugging and error tracking.pandasai/smart_datalake/__init__.py (5)
24-30: The new import statement for
output_type_factory
is correctly placed in alphabetical order. This factory function will be used to create instances of different output types based on the providedoutput_type
parameter.255-276: The method signature for
chat()
has been updated to include an optionaloutput_type
parameter. This parameter allows users to specify the expected output type of the query result. The docstring has been updated accordingly to explain the purpose and possible values of this parameter. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.289-293: An instance of the appropriate output type helper is created using the
output_type_factory
function. This helper will be used later to validate the output type of the query result. The logger is passed as a parameter to the factory function, which is good for consistency in logging across different parts of the application.303-304: The
output_type_hint
is added to the context dictionary. This hint will be interpolated into the prompt template when generating Python code from the query. This is a good way to guide the language model towards generating code that produces the expected output type.361-367: The output type of the query result is validated using the
validate()
method of the output type helper. If the validation fails, a warning log is generated with the validation logs. This is a good practice as it ensures that the output type matches the user's expectation and provides useful feedback in case of a mismatch.
@gventuri |
output_type
parameter (#519)output_type
parameter (#519)
@nautics889 thanks a lot for the great effort, looking forward to playing around with this feature (which IMHO is a super interesting one). I've merged on the |
This PR supposed to implement a feature described in #519.
This PR adds a new parameter called
output_type
to be passed in.chat()
methods. See original requirements for details.GeneratePythonCodePrompt
, addoutput_type_hint
variable to be interpolated.chat()
method forSmartDataframe
andSmartDatalake
, add optionaloutput_type
parameterget_output_type_hint()
inGeneratePythonCodePrompt
classdefault_values
when forming prompt's template contextTestGeneratePythonCodePrompt
output_type
interpotaion to a promptouput_type
toGeneratePythonCodePrompt
Summary by CodeRabbit
pandasai
package. Users can now specify the expected output type when interacting with theSmartDataFrame
andSmartDataLake
classes, improving the predictability of results.huggingface_text_gen.py
for better readability.