Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat : Airtable connector Support #635

Merged

Conversation

Tanmaypatil123
Copy link
Contributor

@Tanmaypatil123 Tanmaypatil123 commented Oct 10, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.

Implemented Airtable connector with the help of official docs.
Required parameters are according to new policy of airtable . Authorization tokenis used for authentication and baseid and table_name required to connect and load data.

Summary by CodeRabbit

  • New Feature: Introduced the AirtableConnector to the pandasai library, enabling users to connect to Airtable databases and retrieve data directly into a Pandas DataFrame. This feature enhances data accessibility and manipulation.
  • New Feature: Added caching functionality to the AirtableConnector to improve performance by storing and reusing data.
  • Documentation: Updated the connectors.md file to include instructions and examples on how to use the new AirtableConnector.
  • Test: Added a comprehensive suite of unit tests for the AirtableConnector to ensure its functionality and reliability.
  • New Feature: Enhanced error handling with the introduction of InvalidRequestError and APIKeyNotFoundError exceptions.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 10, 2023

Walkthrough

This pull request introduces the AirtableConnector class to the pandasai library, enabling users to connect to Airtable databases, fetch data, and perform operations on the data. The connector uses the requests library for HTTP requests and returns results as a Pandas DataFrame. The PR also includes unit tests for the new class and updates the documentation with usage examples.

Changes

File Summary
pandasai/connectors/... Introduced AirtableConnector and AirtableConnectorConfig classes for connecting to Airtable databases, fetching data, and performing operations.
tests/connectors/... Added unit tests for the AirtableConnector class, verifying the functionality of its methods and properties.
docs/connectors.md Updated documentation with information about the new Airtable connector and usage examples.
examples/from_airtable.py Added an example script demonstrating how to use the AirtableConnector class.
pandasai/exceptions.py Introduced new exception classes for improved error handling.

"In the land of code, changes are afoot,
A new connector, isn't that good? 🎉
With Airtable in sight, and data to boot,
Our pandasai library, gets a new recruit! 🐼📚"


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.
  • Additionally, enter @coderabbitai pause to halt the bot's ongoing reviews as any issue or review comment. To
    resume reviews, enter @coderabbitai resume as any issue or review comment.

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2023

Codecov Report

Merging #635 (f6955eb) into main (720811f) will decrease coverage by 0.84%.
Report is 5 commits behind head on main.
The diff coverage is 66.66%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #635      +/-   ##
==========================================
- Coverage   84.43%   83.59%   -0.84%     
==========================================
  Files          65       67       +2     
  Lines        3019     3188     +169     
==========================================
+ Hits         2549     2665     +116     
- Misses        470      523      +53     
Files Coverage Δ
pandasai/connectors/__init__.py 100.00% <100.00%> (ø)
pandasai/connectors/base.py 87.64% <100.00%> (+0.58%) ⬆️
pandasai/exceptions.py 88.88% <100.00%> (+4.27%) ⬆️
pandasai/helpers/openai_info.py 80.00% <ø> (ø)
pandasai/connectors/airtable.py 65.00% <65.00%> (ø)

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 9

Commits Files that changed from the base of the PR and between 720811f and 197740b.
Files selected for processing (5)
  • pandasai/connectors/init.py (2 hunks)
  • pandasai/connectors/airtable.py (1 hunks)
  • pandasai/connectors/base.py (1 hunks)
  • pandasai/helpers/openai_info.py (2 hunks)
  • tests/connectors/test_airtable.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • pandasai/connectors/init.py
Additional comments (Suppressed): 2
pandasai/connectors/base.py (1)
  • 23-30: The AirtableConnectorConfig class is introduced to hold the configuration for the Airtable connector. Ensure that the token and baseID are securely stored and not logged or exposed in any way. Also, consider adding input validation or type checking for these fields to prevent misuse or misconfiguration.
pandasai/helpers/openai_info.py (1)
  • 45-53: No changes have been made to the function signature or the docstring of get_openai_token_cost_for_model(). Ensure that the function's usage elsewhere in the codebase is still valid.

Comment on lines +19 to +20
self.assertEqual(self.connector._config, self.config)
self.assertEqual(self.connector._root_url, self.root_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_constructor_and_properties only checks if the _config and _root_url properties are set correctly. It would be beneficial to also verify if other properties like token, baseID, and table are correctly set in the AirtableConnector instance.

],
"offset": "itrowYGFfoBEIob3C/recsxnHUagIce7nB2",
}
print(expected_data_json)
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print statement seems to be a leftover from debugging and should be removed to keep the test output clean.

-        print(expected_data_json)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tanmaypatil123 this is a leftover!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like humans are talking. I'll be quiet for now. If you need me again, just tag @coderabbitai in a new comment.

Comment on lines 80 to 84
self.connector.head = Mock(return_value=expected_data)
head_data = self.connector.head()

self.assertEqual(head_data.equals(expected_data), True)
self.assertLessEqual(len(head_data), 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the test_execute method, in the test_head method, you are mocking the head method of the connector and then calling it. This doesn't test the actual head method implementation. Instead, you should mock the requests library's method that head uses to make the API call, and then check if head returns the expected data.

pandasai/helpers/openai_info.py Show resolved Hide resolved
pandasai/connectors/airtable.py Outdated Show resolved Hide resolved
Comment on lines 38 to 54
def _init_connection(self, config: BaseConnectorConfig):
"""
make connection to database
"""
config = config.dict()
_session = requests.Session()
_session.headers = {"Authorization": f"Bearer {config['token']}"}
url = f"{self._root_url}{config['baseID']}/{config['table']}"
response = _session.head(url=url)
if response.status_code == 200:
self._session = _session
else:
raise ValueError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _init_connection method is making a HEAD request to the Airtable API to check the connection. This is not a common practice and might lead to unnecessary network traffic. Consider removing this check or replacing it with a more lightweight method, if available in the Airtable API.

    def _init_connection(self, config: BaseConnectorConfig):
        """
        make connection to database
        """
        config = config.dict()
        _session = requests.Session()
        _session.headers = {"Authorization": f"Bearer {config['token']}"}
        url = f"{self._root_url}{config['baseID']}/{config['table']}"
        response = _session.head(url=url)
        if response.status_code == 200:
            self._session = _session
        else:
            raise ValueError(
                f"""Failed to connect to Airtable. 
                Status code: {response.status_code}, 
                message: {response.text}"""
            )

pandasai/connectors/airtable.py Outdated Show resolved Hide resolved
Comment on lines 83 to 92
def head(self):
"""
Return the head of the table that
the connector is connected to.

Returns :
DatFrameType: The head of the data source
that the conector is connected to .
"""
return self._response.head()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head method is returning the head of the _response DataFrame. However, it's not checking if _response is None or empty. This could lead to an AttributeError if head is called before execute. Consider adding a check for this.

    def head(self):
        """
        Return the head of the table that
          the connector is connected to.

        Returns :
            DatFrameType: The head of the data source
                 that the conector is connected to .
        """
        if self._response is not None:
            return self._response.head()
        else:
            raise ValueError("No data available. Execute the connector first.")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Commits Files that changed from the base of the PR and between 197740b and 826aa02.
Files selected for processing (1)
  • tests/connectors/test_airtable.py (1 hunks)
Additional comments (Suppressed): 1
tests/connectors/test_airtable.py (1)
  • 11-13: The dict() method is used on the AirtableConnectorConfig instance, but it's not clear if this method is defined in the AirtableConnectorConfig class. If it's not defined, this will raise an AttributeError. Please verify.

Comment on lines 48 to 50
self.connector.execute = Mock(return_value=expected_data)
execute_data = self.connector.execute()
self.assertEqual(execute_data.equals(expected_data), True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute method is mocked to return expected_data, and then it's called and the returned data is compared with expected_data. This test will always pass and doesn't test the actual functionality of the execute method. Consider mocking the requests library's methods instead, and then test if execute returns the correct data.

Comment on lines 79 to 81
self.connector.head = Mock(return_value=expected_data)
head_data = self.connector.head()

Copy link
Contributor

@coderabbitai coderabbitai bot Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the execute method, the head method is mocked to return expected_data, and then it's called and the returned data is compared with expected_data. This test will always pass and doesn't test the actual functionality of the head method. Consider mocking the requests library's methods instead, and then test if head returns the correct data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tanmaypatil123 mocking the method that you are testing is a bad practice, as it always returns the value you are forcing it to return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like humans are talking. I'll be quiet for now. If you need me again, just tag @coderabbitai in a new comment.

Copy link
Collaborator

@gventuri gventuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to increase the test coverage and let's add an example (from_airtable.py) and also let's add it on the connectors.md file


def __init__(
self,
config: Optional[Union[AirtableConnectorConfig, dict]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a cache mechanism similar to the other connectors!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tanmaypatil123 check out yahoo finance connector for references

elif not config:
airtable_env_vars = {
"token": "AIRTABLE_AUTH_TOKEN",
"baseID": "AIRTABLE_BASE_ID",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make baseID it snake case (base_id)?

make connection to database
"""
config = config.dict()
_session = requests.Session()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is _session underscored?

message: {_response.text}"""
)
return self._response

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a def fallback_name, similarly to all the other connectors.

Example:

@property
    def fallback_name(self):
        """
        Return the fallback name of the connector.

        Returns:
            str: The fallback name of the connector.
        """
        return self._config.table

DatFrameType: The head of the data source
that the conector is connected to .
"""
return self._response.head()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head method should be independent. What happens if a _response hasn't been defined yet?


elif not config:
airtable_env_vars = {
"token": "AIRTABLE_AUTH_TOKEN",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename to api_key to be consistent with the variable names in the LLMs.

Connecter configuration for Airtable data.
"""

token: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_key

"""

token: str
baseID: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_id

@gventuri
Copy link
Collaborator

Additionally:

  • the config should accept a param "where" similar to all the other connectors
  • we should either handle pagination or see how many records we can get max (check out the maxRecords param)

@gventuri
Copy link
Collaborator

Also, we should fallback to the env variables for base id and token, similarly to what we've done in the other connectors (i.e. mysql).

@gventuri
Copy link
Collaborator

Btw did you try it? It doesn't seem to work for me.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 11

Commits Files that changed from the base of the PR and between 826aa02 and cd78117.
Files selected for processing (5)
  • docs/connectors.md (1 hunks)
  • examples/from_airtable.py (1 hunks)
  • pandasai/connectors/airtable.py (1 hunks)
  • pandasai/connectors/base.py (1 hunks)
  • tests/connectors/test_airtable.py (1 hunks)
Additional comments (Suppressed): 8
pandasai/connectors/base.py (2)
  • 23-30: The new AirtableConnectorConfig class is introduced to handle the configuration for the Airtable connector. Ensure that the api_key, base_id, and database attributes are correctly used in the AirtableConnector class and that the api_key and base_id are properly secured and not exposed in any logs or error messages.

  • 33-35: The SQLBaseConnectorConfig class is unchanged. No action required.

examples/from_airtable.py (4)
  • 6-16: The AirtableConnector is initialized with a configuration dictionary. Ensure that the api_key, table, and base_id values are replaced with actual values before running the code. Also, the where clause is optional and can be used to filter the data. If not needed, it can be removed.

  • 19-19: The OpenAI object is initialized with an API key. Make sure to replace "OPENAI_API_KEY" with your actual OpenAI API key.

  • 20-20: The SmartDataframe is initialized with the AirtableConnector and a configuration dictionary. The llm key in the configuration dictionary is set to the OpenAI object. This is a good practice as it allows for easy configuration of different components.

  • 22-22: The chat method is called on the SmartDataframe object with a query string. This is a good practice as it allows for easy querying of the data. However, ensure that the query string is relevant to your data.

docs/connectors.md (2)
  • 187-189: The new hunk starts with a continuation of the previous section. Ensure that the context is clear for readers.

  • 193-194: The introduction of the Airtable connector is clear and concise, explaining its purpose and usage.

Comment on lines 95 to 103
@patch("requests.get")
def test_columns_count_property(self, mock_request_get):
# Test columns_count property
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
rows_count = self.connector.columns_count
self.assertEqual(rows_count, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_columns_count_property method, you are testing the columns_count property. This is a good practice. However, consider adding a test case for when the JSON response does not contain the expected data to ensure your code handles such scenarios correctly.

Comment on lines +198 to +218
from pandasai.connectors import AirtableConnector
from pandasai import SmartDataframe


airtable_connectors = AirtableConnector(
config={
"api_key": "AIRTABLE_API_TOKEN",
"table":"AIRTABLE_TABLE_NAME",
"base_id":"AIRTABLE_BASE_ID",
"where" : [
# this is optional and filters the data to
# reduce the size of the dataframe
["Status" ,"=","In progress"]
]
}
)

df = SmartDataframe(airtable_connectors)

df.chat("How many rows are there in data ?")
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code snippet demonstrates the usage of the Airtable connector. It's clear and easy to understand. However, the api_key, table, and base_id are hardcoded. In a real-world scenario, these should be stored securely and not exposed in the code.

airtable_connectors = AirtableConnector(
    config={
-        "api_key": "AIRTABLE_API_TOKEN",
-        "table":"AIRTABLE_TABLE_NAME",
-        "base_id":"AIRTABLE_BASE_ID",
+        "api_key": os.getenv("AIRTABLE_API_TOKEN"),
+        "table": os.getenv("AIRTABLE_TABLE_NAME"),
+        "base_id": os.getenv("AIRTABLE_BASE_ID"),
        "where" : [
            # this is optional and filters the data to
            # reduce the size of the dataframe
            ["Status" ,"=","In progress"]
        ]
    }
)

This change uses the os.getenv function to retrieve the values from environment variables, which is a common practice for handling sensitive data like API keys.

Comment on lines 26 to 27
if config["api_key"] and config["base_id"] and config["table"]:
config = AirtableConnectorConfig(**config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code assumes that if config is a dictionary, it will always contain the keys "api_key", "base_id", and "table". If any of these keys are missing, a KeyError will be raised. Consider adding error handling to check for the presence of these keys before accessing them.

- if config["api_key"] and config["base_id"] and config["table"]:
+ if "api_key" in config and "base_id" in config and "table" in config:

Comment on lines 51 to 64
url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
)
if response.status_code == 200:
self.logger.log(
"""
Connected to Airtable.
"""
)
else:
raise ValueError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not handle potential exceptions that could be raised by the requests.head call. Consider adding a try-except block to handle potential requests.exceptions.RequestException exceptions.

- response = requests.head(
-     url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
- )
+ try:
+     response = requests.head(
+         url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
+     )
+ except requests.exceptions.RequestException as e:
+     raise ValueError(f"Failed to connect to Airtable: {str(e)}")

Comment on lines 137 to 149
response = requests.get(
url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
)
if response.status_code == 200:
data = response.json()
data = self.preprocess(data=data)
self._save_cache(data)
else:
raise ValueError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not handle potential exceptions that could be raised by the requests.get call. Consider adding a try-except block to handle potential requests.exceptions.RequestException exceptions.

- response = requests.get(
-     url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
- )
+ try:
+     response = requests.get(
+         url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
+     )
+ except requests.exceptions.RequestException as e:
+     raise ValueError(f"Failed to connect to Airtable: {str(e)}")

Comment on lines 172 to 183
response = requests.get(
url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
)
if response.status_code == 200:
data = response.json()
data = self.preprocess(data=data)
else:
raise ValueError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not handle potential exceptions that could be raised by the requests.get call. Consider adding a try-except block to handle potential requests.exceptions.RequestException exceptions.

- response = requests.get(
-     url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
- )
+ try:
+     response = requests.get(
+         url=url, headers={"Authorization": f"Bearer {self._config.api_key}"}
+     )
+ except requests.exceptions.RequestException as e:
+     raise ValueError(f"Failed to connect to Airtable: {str(e)}")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Commits Files that changed from the base of the PR and between cd78117 and 1ec1dd5.
Files selected for processing (5)
  • docs/connectors.md (1 hunks)
  • examples/from_airtable.py (1 hunks)
  • pandasai/connectors/airtable.py (1 hunks)
  • pandasai/connectors/base.py (1 hunks)
  • pandasai/exceptions.py (1 hunks)
Additional comments (Suppressed): 5
pandasai/exceptions.py (2)
  • 8-16: The new InvalidRequestError exception is introduced to handle unsuccessful requests. Ensure that this exception is properly caught and handled in the code where requests are made. Also, the docstring for the exception could be improved to provide more information about when and why this exception might be raised.

  • 18-20: The APIKeyNotFoundError exception has been moved down in the code. Ensure that this does not affect any imports or usage of this exception in other parts of the codebase.

pandasai/connectors/base.py (1)
  • 23-33: The AirtableConnectorConfig class is well defined with appropriate attributes. However, ensure that the api_key, base_id, and database are properly validated before use to prevent potential security issues or errors.
docs/connectors.md (1)
  • 187-189: The new hunk starts with a code snippet that seems to be a continuation from a previous section. Ensure that this transition is smooth and the context is clear to the reader.
pandasai/connectors/airtable.py (1)
  • 149-163: The code makes a GET request to the Airtable API to fetch data. However, it does not handle

Comment on lines 6 to 16
airtable_connectors = AirtableConnector(
config={
"api_key": "AIRTABLE_API_TOKEN",
"table": "AIRTABLE_TABLE_NAME",
"base_id": "AIRTABLE_BASE_ID",
"where": [
# this is optional and filters the data to
# reduce the size of the dataframe
["Status", "==", "In progress"]
],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AirtableConnector is initialized with a hardcoded configuration. This might not be an issue if this is just an example, but in a production environment, sensitive data like api_key and base_id should be stored securely and not hardcoded. Consider using environment variables or a secure configuration management system.

}
)

llm = OpenAI("OPENAI_API_KEY")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAI instance is initialized with a hardcoded API key. As with the AirtableConnector, this is a security concern. API keys should be stored securely and not hardcoded.

llm = OpenAI("OPENAI_API_KEY")
df = SmartDataframe(airtable_connectors, config={"llm": llm})

response = df.chat("How many rows are there in data ?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chat method is called on the SmartDataframe object without any error handling. If an error occurs during the execution of this method, it could cause the program to crash. Consider wrapping this call in a try/except block to handle potential exceptions.

@@ -20,6 +20,17 @@ class BaseConnectorConfig(BaseModel):
where: list[list[str]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The where parameter is not used in the AirtableConnectorConfig class. If it's not needed, consider removing it to avoid confusion.

Comment on lines 34 to 36
class SQLBaseConnectorConfig(BaseConnectorConfig):
"""
Base Connector configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQLBaseConnectorConfig class seems to be duplicated. If it's not intended, consider removing it to maintain code cleanliness.

Comment on lines 193 to 218
The Airtable connector allows you to connect to Airtable Projects Tables, by simply passing the `base_id` , `api_key` and `table_name` of the table you want to analyze.

To use the Airtable connector, you only need to import it into your Python code and pass it to a `SmartDataframe` or `SmartDatalake` object:

```python
from pandasai.connectors import AirtableConnector
from pandasai import SmartDataframe


airtable_connectors = AirtableConnector(
config={
"api_key": "AIRTABLE_API_TOKEN",
"table":"AIRTABLE_TABLE_NAME",
"base_id":"AIRTABLE_BASE_ID",
"where" : [
# this is optional and filters the data to
# reduce the size of the dataframe
["Status" ,"==","In progress"]
]
}
)

df = SmartDataframe(airtable_connectors)

df.chat("How many rows are there in data ?")
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hunk introduces the Airtable connector and provides a code snippet demonstrating its usage. The code seems to be correct and well-explained. However, the api_key, table, and base_id are hardcoded strings. It's recommended to use environment variables or secure storage for sensitive data like API keys.

airtable_connectors = AirtableConnector(
    config={
-        "api_key": "AIRTABLE_API_TOKEN",
-        "table":"AIRTABLE_TABLE_NAME",
-        "base_id":"AIRTABLE_BASE_ID",
+        "api_key": os.getenv("AIRTABLE_API_TOKEN"),
+        "table": os.getenv("AIRTABLE_TABLE_NAME"),
+        "base_id": os.getenv("AIRTABLE_BASE_ID"),
        "where" : [
            # this is optional and filters the data to
            # reduce the size of the dataframe
            ["Status" ,"==","In progress"]
        ]
    }
)

Don't forget to import the os module at the beginning of your script.

Comment on lines +56 to +69
url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
)
if response.status_code == 200:
self.logger.log(
"""
Connected to Airtable.
"""
)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes a HEAD request to the Airtable API to check the connection. However, it does not handle potential network errors that could occur during the request, such as a timeout or a connection error. It would be better to wrap the request in a try/except block and handle these potential errors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 5

Commits Files that changed from the base of the PR and between 1ec1dd5 and 2171296.
Files selected for processing (2)
  • pandasai/connectors/airtable.py (1 hunks)
  • pandasai/connectors/base.py (1 hunks)
Additional comments (Suppressed): 2
pandasai/connectors/base.py (2)
  • 23-30: The AirtableConnectorConfig class has been added to handle the configuration for the Airtable connector. Ensure that the api_key, base_id, and database attributes are correctly used in the AirtableConnector class and that the api_key and base_id are properly secured and not exposed in any logs or error messages.

  • 33-35: The SQLBaseConnectorConfig class has not been modified in this hunk. Ensure that any changes to this class in other hunks do not affect its current functionality.

Comment on lines 95 to 104
cache_path = self._get_cache_path()
if not os.path.exists(cache_path):
return None

# If the file is older than 1 day , delete it.
if os.path.getmtime(cache_path) < time.time() - self._cache_interval:
if self.logger:
self.logger.log(f"Deleting expired cached data from {cache_path}")
os.remove(cache_path)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation logic is based on the modification time of the cache file. However, this approach might not work as expected if the system time is changed. A more reliable approach would be to store the cache creation time within the cache file itself and use that for invalidation.

Comment on lines 148 to 162
url = f"{self._root_url}{self._config.base_id}/{self._config.table}"
response = requests.get(
url=url,
headers={"Authorization": f"Bearer {self._config.api_key}"},
)
if response.status_code == 200:
data = response.json()
data = self.preprocess(data=data)
self._save_cache(data)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not handle potential network errors that might occur during the requests.get call. It would be better to wrap this call in a try-except block and handle potential network errors.

-        response = requests.get(
-            url=url,
-            headers={"Authorization": f"Bearer {self._config.api_key}"},
-        )
-        if response.status_code == 200:
-            data = response.json()
-            data = self.preprocess(data=data)
-            self._save_cache(data)
-        else:
-            raise InvalidRequestError(
-                f"""Failed to connect to Airtable. 
-                    Status code: {response.status_code}, 
-                    message: {response.text}"""
-            )
+        try:
+            response = requests.get(
+                url=url,
+                headers={"Authorization": f"Bearer {self._config.api_key}"},
+            )
+            response.raise_for_status()
+            data = response.json()
+            data = self.preprocess(data=data)
+            self._save_cache(data)
+        except requests.exceptions.RequestException as e:
+            raise InvalidRequestError(
+                f"""Failed to connect to Airtable. 
+                    Error: {str(e)}"""
+            )

Comment on lines 176 to 179
if self._config.where:
for i in self._config.where:
filter_string = f"{i[0]} {i[1]} '{i[2]}'"
df = df.query(filter_string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code assumes that the 'where' attribute in the config is a list of conditions. However, it does not validate if this is the case. If 'where' is not a list or does not contain valid conditions, the code will fail at runtime. It would be better to add a check to ensure 'where' is a list and contains valid conditions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 9

Commits Files that changed from the base of the PR and between 2171296 and a1f25b8.
Files selected for processing (1)
  • pandasai/connectors/airtable.py (1 hunks)

Comment on lines +21 to +25
def __init__(
self,
config: Optional[Union[AirtableConnectorConfig, dict]] = None,
cache_interval: int = 600,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache_interval parameter is not documented in the function docstring. Please add a description for it.

Comment on lines +26 to +32
if isinstance(config, dict):
if "api_key" in config and "base_id" in config and "table" in config:
config = AirtableConnectorConfig(**config)
else:
raise KeyError(
"Please specify all api_key,table,base_id properly in config ."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for the presence of "api_key", "base_id", and "table" in the config dictionary is not robust. If any of these keys have a value that evaluates to False (like an empty string), the condition will fail. Consider using in to check for key existence.

Comment on lines +34 to +42
elif not config:
airtable_env_vars = {
"api_key": "AIRTABLE_API_TOKEN",
"base_id": "AIRTABLE_BASE_ID",
"table": "AIRTABLE_TABLE_NAME",
}
config = AirtableConnectorConfig(
**self._populate_config_from_env(config, airtable_env_vars)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config parameter is not used in the _populate_config_from_env method call. If it's not needed, consider removing it.

Comment on lines +78 to +83
cache_dir = os.path.join(os.getcwd(), "")
try:
cache_dir = os.path.join((find_project_root()), "cache")
except ValueError:
cache_dir = os.path.join(os.getcwd(), "cache")
return os.path.join(cache_dir, f"{self._config.table}_data.parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _get_cache_path method does not handle potential os exceptions. Consider adding a try-except block to handle possible file system errors.

Comment on lines 156 to 179
"""
Feteches data from airtable server through
API and converts it to DataFrame.
"""
url = f"{self._root_url}{self._config.base_id}/{self._config.table}"
params = {}

if self._config.where:
params["filterByFormula"] = self.build_formula()
response = requests.get(
url=url,
headers={"Authorization": f"Bearer {self._config.api_key}"},
params=params,
)
if response.status_code == 200:
data = response.json()
data = self.preprocess(data=data)
self._save_cache(data)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_data method does not handle potential requests exceptions. Consider adding a try-except block to handle possible network errors or timeouts.

df = pd.DataFrame(records)
return df

def head(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be cache using @cache for reference check sql.py

DatFrameType: The head of the data source
that the conector is connected to .
"""
return self.fetch_data().head()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here fetch only specific records lets say first 5 or random 5 rows instead of all.

int: The number of rows in the data source that the connector is
connected to.
"""
data = self.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep properties of rows_count and store it in instance so we don't have to call again and again

int: The number of columns in the data source that the connector is
connected to.
"""
data = self.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same like rows count to not call again and again. For this use head function instead.

int: The hash code that is unique to the columns of the data source
that the connector is connected to.
"""
data = self.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also add where queries as well. Otherwise it will return wrong data from the cache.

Returns:
DataFrameType: The result of the connector.
"""
return self.fetch_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have cached it but never used it. You can for the cache like:

cached = self._cached() or self._cached(include_additional_filters=True)
if cached:
return pd.read_parquet(cached)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Commits Files that changed from the base of the PR and between a1f25b8 and 3f8f55f.
Files selected for processing (3)
  • docs/connectors.md (1 hunks)
  • examples/from_airtable.py (1 hunks)
  • pandasai/connectors/airtable.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • docs/connectors.md
Additional comments (Suppressed): 3
examples/from_airtable.py (3)
  • 6-16: The AirtableConnector is initialized with a configuration dictionary. Ensure that the api_key, table, and base_id values are replaced with actual values before running the script. Also, the where clause is optional and can be used to filter the data. If not needed, it can be removed.

  • 19-19: The OpenAI class is initialized with an API key. Make sure to replace "OPENAI_API_KEY" with your actual OpenAI API key.

  • 22-22: The chat method is called on the SmartDataframe object. Ensure that the question passed to the chat method is relevant to your data.

Comment on lines +38 to +47

elif not config:
airtable_env_vars = {
"api_key": "AIRTABLE_API_TOKEN",
"base_id": "AIRTABLE_BASE_ID",
"table": "AIRTABLE_TABLE_NAME",
}
config = AirtableConnectorConfig(
**self._populate_config_from_env(config, airtable_env_vars)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is not handling the case where the config is not a dictionary and is not None. This could lead to unexpected behavior if a user passes an invalid configuration. Consider adding an else clause to handle this case.

Comment on lines +58 to +74
config = config.dict()
url = f"{self._root_url}{config['base_id']}/{config['table']}"
response = requests.head(
url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
)
if response.status_code == 200:
self.logger.log(
"""
Connected to Airtable.
"""
)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _init_connection method is making a HEAD request to the Airtable API to check the connection. This is a good practice, but it would be better to handle more specific HTTP status codes. For example, a 401 status code means that the API key is invalid, and a 404 status code means that the base or table does not exist. Handling these specific cases can provide more informative error messages to the user.

Comment on lines +104 to +109
# If the file is older than 1 day , delete it.
if os.path.getmtime(cache_path) < time.time() - self._cache_interval:
if self.logger:
self.logger.log(f"Deleting expired cached data from {cache_path}")
os.remove(cache_path)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation logic is based on the modification time of the cache file. This could lead to unexpected behavior if the cache file is manually modified. Consider using a separate metadata file or a database to store the cache creation time.

Comment on lines 146 to 154
cached = self._cached() or self._cached(include_additional_filters=True)
if cached:
return pd.read_parquet(cached)

if isinstance(self.instance, pd.DataFrame):
return self.instance
else:
self.instance = self.fetch_data()
return self.instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute method is checking if self.instance is a DataFrame and returning it if it is. This could lead to unexpected behavior if self.instance is modified elsewhere in the code. Consider always fetching the data from the server or the cache in the execute method.

Comment on lines 167 to 191

def fetch_data(self):
"""
Feteches data from airtable server through
API and converts it to DataFrame.
"""
url = f"{self._root_url}{self._config.base_id}/{self._config.table}"
params = {}
if self._config.where:
params["filterByFormula"] = self.build_formula()
response = requests.get(
url=url,
headers={"Authorization": f"Bearer {self._config.api_key}"},
params=params,
)
if response.status_code == 200:
data = response.json()
data = self.preprocess(data=data)
self._save_cache(data)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_data method is making a GET request to the Airtable API to fetch the data. This is a good practice, but it would be better to handle more specific HTTP status codes, similar to the _init_connection method.

Comment on lines 199 to 204
records = [
{"id": record["id"], **record["fields"]} for record in data["records"]
]

df = pd.DataFrame(records)
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preprocess method is transforming the JSON response from the Airtable API into a DataFrame. This is a good practice, but it assumes that every record in the response has an id field and a fields field. Consider adding error handling for cases where these fields are missing.

Comment on lines +207 to +215
def head(self):
"""
Return the head of the table that
the connector is connected to.

Returns :
DatFrameType: The head of the data source
that the conector is connected to .
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head method is decorated with the cache decorator from the functools module. This means that the method result is cached and the same result is returned for subsequent calls with the same arguments. However, the head method does not take any arguments, so the caching does not have any effect. Consider removing the cache decorator.

Comment on lines 223 to 253
@cached_property
def rows_count(self):
"""
Return the number of rows in the data source that the connector is
connected to.

Returns:
int: The number of rows in the data source that the connector is
connected to.
"""
if self._rows_count is not None:
return self._rows_count
data = self.execute()
self._rows_count = len(data)
return self._rows_count

@cached_property
def columns_count(self):
"""
Return the number of columns in the data source that the connector is
connected to.

Returns:
int: The number of columns in the data source that the connector is
connected to.
"""
if self._columns_count is not None:
return self._columns_count
data = self.execute()
self._columns_count = len(data.columns)
return self._columns_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows_count and columns_count methods are decorated with the cached_property decorator from the functools module. This is a good practice for properties that are expensive to compute. However, these methods are also storing their results in instance variables (_rows_count and _columns_count). This is redundant and could lead to inconsistencies if these instance variables are modified elsewhere in the code. Consider removing the instance variables and relying solely on the cached_property decorator.

Comment on lines +256 to +264
def column_hash(self):
"""
Return the hash code that is unique to the columns of the data source
that the connector is connected to.

Returns:
int: The hash code that is unique to the columns of the data source
that the connector is connected to.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column_hash property is computing a hash of the column names and the filter formula. This is a good practice for caching purposes. However, it assumes that self.instance is a DataFrame. Consider adding error handling for cases where self.instance is not a DataFrame.

@@ -225,8 +246,11 @@ def columns_count(self):
int: The number of columns in the data source that the connector is
connected to.
"""
if self._columns_count is not None:
return self._columns_count
data = self.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for column count get head() as it is also cached using @cache also it will have less data

if cached:
return pd.read_parquet(cached)

if isinstance(self.instance, pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This self.instance is not need because we are already caching it



class AirtableConnector(BaseConnector):
"""
Airtable connector to retrieving record data.
"""

_rows_count: int = None
_columns_count: int = None
instance = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extra because we already cache the dataframe

@@ -200,9 +213,14 @@ def head(self):
DatFrameType: The head of the data source
that the conector is connected to .
"""
return self.fetch_data().head()
# return self.fetch_data().head()
if isinstance(self.instance, pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove this instance for the head call api with params limit = 5

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Commits Files that changed from the base of the PR and between 3f8f55f and 8c0e07b.
Files selected for processing (2)
  • pandasai/connectors/airtable.py (1 hunks)
  • tests/connectors/test_airtable.py (1 hunks)
Additional comments (Suppressed): 1
pandasai/connectors/airtable.py (1)
  • 39-47: The code checks if the config is None and if so, it tries to populate it from environment variables. However, the _populate_config_from_env method is not defined in this class or its parent class. Please ensure that this method is implemented and correctly populates the configuration from environment variables.

Comment on lines 95 to 103
@patch("requests.get")
def test_columns_count_property(self, mock_request_get):
# Test columns_count property
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
rows_count = self.connector.columns_count
self.assertEqual(rows_count, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_columns_count_property method, you are testing the columns_count property of the AirtableConnector class. This is a good practice. However, you are not asserting that the requests.get method was called with the correct arguments. Consider adding an assertion to check that the correct URL and headers were passed to the requests.get method.

Comment on lines +26 to +37
def __init__(
self,
config: Optional[Union[AirtableConnectorConfig, dict]] = None,
cache_interval: int = 600,
):
if isinstance(config, dict):
if "api_key" in config and "base_id" in config and "table" in config:
config = AirtableConnectorConfig(**config)
else:
raise KeyError(
"Please specify all api_key,table,base_id properly in config ."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __init__ method checks if the config parameter is a dictionary and if it contains the keys api_key, base_id, and table. If not, it raises a KeyError. This could be improved by providing a more specific error message for each missing key. This way, the user will know exactly which key is missing from the configuration.

-            if "api_key" in config and "base_id" in config and "table" in config:
-                config = AirtableConnectorConfig(**config)
-            else:
-                raise KeyError(
-                    "Please specify all api_key,table,base_id properly in config ."
-                )
+            missing_keys = [key for key in ["api_key", "base_id", "table"] if key not in config]
+            if missing_keys:
+                raise KeyError(f"Missing keys in config: {', '.join(missing_keys)}")
+            config = AirtableConnectorConfig(**config)

Comment on lines +54 to +74
def _init_connection(self, config: BaseConnectorConfig):
"""
make connection to database
"""
config = config.dict()
url = f"{self._root_url}{config['base_id']}/{config['table']}"
response = requests.head(
url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
)
if response.status_code == 200:
self.logger.log(
"""
Connected to Airtable.
"""
)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _init_connection method sends a HEAD request to the Airtable API to check the connection. If the status code is not 200, it raises an InvalidRequestError. However, the error message could be improved by including the URL that failed to connect. This would help in debugging connection issues.

-                f"""Failed to connect to Airtable. 
-                    Status code: {response.status_code}, 
-                    message: {response.text}"""
+                f"""Failed to connect to Airtable at {url}. 
+                    Status code: {response.status_code}, 
+                    message: {response.text}"""

Comment on lines +90 to +109
def _cached(self, include_additional_filters: bool = False):
"""
Returns the cached Airtable data if it exists and
is not older than the cache interval.

Returns :
DataFrame | None : The cached data if
it exists and is not older than the cache
interval, None otherwise.
"""
cache_path = self._get_cache_path(include_additional_filters)
if not os.path.exists(cache_path):
return None

# If the file is older than 1 day , delete it.
if os.path.getmtime(cache_path) < time.time() - self._cache_interval:
if self.logger:
self.logger.log(f"Deleting expired cached data from {cache_path}")
os.remove(cache_path)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _cached method checks if the cache file exists and if it's older than the cache interval. If it's older, it deletes the file. However, it doesn't handle the case where the file deletion fails. It would be better to handle this case and log an error message.

-            os.remove(cache_path)
+            try:
+                os.remove(cache_path)
+            except OSError as e:
+                if self.logger:
+                    self.logger.log(f"Failed to delete expired cache file {cache_path}: {str(e)}")

Comment on lines 192 to 206
data = []
while True:
response = self._request_api(params=params)

if response.status_code == 200:
res = response.json()
data.append(res)
if len(res["records"]) < 100:
break
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _fetch_data method sends a GET request to the Airtable API to fetch data. If the status code is not 200, it raises an InvalidRequestError. Similar to the _init_connection method, the error message could be improved by including the URL that failed to fetch data.

-                f"""Failed to connect to Airtable. 
-                    Status code: {response.status_code}, 
-                    message: {response.text}"""
+                f"""Failed to fetch data from Airtable at {url}. 
+                    Status code: {response.status_code}, 
+                    message: {response.text}"""

Comment on lines 214 to 230
def _preprocess(self, data):
"""
Preprocesses Json response data
To prepare dataframe correctly.
"""
columns = set()
data_dict_list = []
for item in data:
for entry in item["records"]:
data_dict = {"id": entry["id"], "createdTime": entry["createdTime"]}
for field_name, field_value in entry["fields"].items():
data_dict[field_name] = field_value
columns.add(field_name)
data_dict_list.append(data_dict)

df = pd.DataFrame(data_dict_list)
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _preprocess method processes the JSON response data from the Airtable API and converts it to a DataFrame. However, it doesn't handle the case where the fields key is missing in the entry dictionary. This could lead to a KeyError. It would be better to handle this case and set a default value for fields.

-                for field_name, field_value in entry["fields"].items():
+                for field_name, field_value in entry.get("fields", {}).items():

"""
Preprocesses Json response data
To prepare dataframe correctly.
"""
records = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this logic to _fetch_data function to get rid of three nested loops

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 5

Commits Files that changed from the base of the PR and between 8c0e07b and f6955eb.
Files selected for processing (2)
  • pandasai/connectors/airtable.py (1 hunks)
  • tests/connectors/test_airtable.py (1 hunks)
Additional comments (Suppressed): 8
tests/connectors/test_airtable.py (1)
  • 57-58: The fallback_name property is tested twice in this test suite (also at lines 80-83). Consider removing one to avoid redundancy.
-    def test_fallback_name(self):
-        self.assertEqual(self.connector.fallback_name, self.config["table"])
pandasai/connectors/airtable.py (7)
  • 39-47: If config is not provided, the code attempts to populate it from environment variables. This is a good practice for sensitive data like API keys. However, the _populate_config_from_env method is not shown in this hunk. Ensure that it correctly retrieves and validates the environment variables.

  • 58-74: The _init_connection method sends a HEAD request to the Airtable API to check the connection. If the response status code is not 200, it raises an InvalidRequestError. This is a good practice for validating the connection and providing a clear error message if the connection fails. However, consider adding a retry mechanism for transient network errors.

  • 104-109: The _cached method checks if the cache file exists and if it's older than the cache interval. If it's older, it deletes the file. This is a good practice for managing cache files. However, consider handling potential exceptions when deleting the file.

  • 123-127: The _save_cache method saves the DataFrame to a Parquet file. This is a good practice for caching data. However, consider handling potential exceptions when saving the file.

  • 192-215: The _fetch_data method fetches data from the Airtable API and converts it to a DataFrame. It uses pagination to retrieve all records. This is a good practice for handling large datasets. However, consider adding a limit to the number of requests to prevent potential infinite loops.

  • 228-234: The head method retrieves the first 5 records from the Airtable API. This is a good practice for previewing data. However, consider handling potential exceptions when making the request.

  • 262-266: The rows_count and columns_count methods return the number of rows and columns in the data source, respectively. They use the @cached_property decorator to cache the results. This is a good practice for optimizing performance. However, consider handling potential exceptions when executing the connector.

Comment on lines +60 to +68
@patch("requests.get")
def test_execute(self, mock_request_get):
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
execute_data = self.connector.execute()
self.assertEqual(type(execute_data), pd.DataFrame)
self.assertEqual(len(execute_data), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_execute method tests the execute method of the AirtableConnector class. However, it only checks the type and length of the returned data. It would be beneficial to also verify the content of the returned DataFrame to ensure it matches the expected data.

Comment on lines +27 to +37
self,
config: Optional[Union[AirtableConnectorConfig, dict]] = None,
cache_interval: int = 600,
):
if isinstance(config, dict):
if "api_key" in config and "base_id" in config and "table" in config:
config = AirtableConnectorConfig(**config)
else:
raise KeyError(
"Please specify all api_key,table,base_id properly in config ."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config parameter is optional and can be either a dictionary or an AirtableConnectorConfig object. If it's a dictionary, it must contain the keys api_key, base_id, and table. If these keys are not present, a KeyError is raised. Consider providing a more detailed error message to help the user understand what each key represents.

- raise KeyError("Please specify all api_key,table,base_id properly in config .")
+ raise KeyError("Please specify all api_key, table, and base_id properly in the config. api_key is your Airtable API key, base_id is the ID of the base you are connecting to, and table is the name of the table within the base.")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 11

Commits Files that changed from the base of the PR and between f6955eb and b2fe3a7.
Files selected for processing (1)
  • pandasai/connectors/airtable.py (1 hunks)

Comment on lines +27 to +30
self,
config: Optional[Union[AirtableConnectorConfig, dict]] = None,
cache_interval: int = 600,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache_interval parameter is not documented in the function docstring. Please add a description for it.

Comment on lines +32 to +37
if "api_key" in config and "base_id" in config and "table" in config:
config = AirtableConnectorConfig(**config)
else:
raise KeyError(
"Please specify all api_key,table,base_id properly in config ."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for the presence of "api_key", "base_id", and "table" in the config dictionary is not robust. If any of these keys have a value that evaluates to False (like an empty string), the check will pass, but the subsequent code may fail. Consider using dict.get() to check for the presence of these keys.

Comment on lines +59 to +74
url = f"{self._root_url}{config['base_id']}/{config['table']}"
response = requests.head(
url=url, headers={"Authorization": f"Bearer {config['api_key']}"}
)
if response.status_code == 200:
self.logger.log(
"""
Connected to Airtable.
"""
)
else:
raise InvalidRequestError(
f"""Failed to connect to Airtable.
Status code: {response.status_code},
message: {response.text}"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection initialization does not handle potential network errors that could occur when making the request. Consider adding a try-except block to handle exceptions like requests.exceptions.RequestException.

Comment on lines +83 to +87
cache_dir = os.path.join(os.getcwd(), "")
try:
cache_dir = os.path.join((find_project_root()), "cache")
except ValueError:
cache_dir = os.path.join(os.getcwd(), "cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is trying to find the project root and if it fails, it defaults to the current working directory. This could lead to inconsistent behavior depending on where the script is run from. Consider making the cache directory a configurable option.

Comment on lines +105 to +109
if os.path.getmtime(cache_path) < time.time() - self._cache_interval:
if self.logger:
self.logger.log(f"Deleting expired cached data from {cache_path}")
os.remove(cache_path)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation strategy is based on the modification time of the cache file. This could lead to issues if the system clock is changed or if the file is manually modified. Consider using a more robust cache invalidation strategy, like storing the cache creation time in the file itself or in a separate metadata file.

Comment on lines +146 to +154
cached = self._cached() or self._cached(include_additional_filters=True)
if cached:
return pd.read_parquet(cached)

if isinstance(self._instance, pd.DataFrame):
return self._instance
else:
self._instance = self._fetch_data()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute method does not handle potential errors that could occur when fetching data from the API. Consider adding error handling to the _fetch_data method or to the execute method itself.

Comment on lines +192 to +213
data = []
while True:
response = self._request_api(params=params)

if response.status_code != 200:
raise InvalidRequestError(
f"Failed to connect to Airtable. "
f"Status code: {response.status_code}, "
f"message: {response.text}"
)

res = response.json()
records = res.get("records", [])
data.extend({"id": record["id"], **record["fields"]} for record in records)

if len(records) < 100 or "offset" not in res:
break

if "offset" in res:
params["offset"] = res["offset"]

return pd.DataFrame(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data fetching loop does not have a maximum iteration count, which could lead to an infinite loop if the API continually returns an "offset" in the response. Consider adding a maximum iteration count or a timeout to prevent potential infinite loops.

Comment on lines +225 to +231
data = self._request_api(params={"maxRecords": 5})
return pd.DataFrame(
[
{"id": record["id"], **record["fields"]}
for record in data.json()["records"]
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head method does not handle potential errors that could occur when making the API request. Consider adding error handling to this method.

Comment on lines +261 to +263
data = self.head()
self._columns_count = len(data.columns)
return self._columns_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows_count and columns_count methods execute the connector or fetch the head of the data, respectively, even if the count is already stored. Consider checking if the count is already stored before executing the connector or fetching the head.

Comment on lines +276 to +279
self._instance = self.execute()
columns_str = "|".join(self._instance.columns)
columns_str += "WHERE" + self._build_formula()
return hashlib.sha256(columns_str.encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column_hash property does not handle potential errors that could occur when executing the connector. Consider adding error handling to this property.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 7

Commits Files that changed from the base of the PR and between b2fe3a7 and 88c419d.
Files selected for processing (1)
  • tests/connectors/test_airtable.py (1 hunks)

Comment on lines +56 to +59
self.assertEqual(self.connector._config, self.config)
self.assertEqual(self.connector._root_url, self.root_url)
self.assertEqual(self.connector._cache_interval, 600)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_constructor_and_properties is checking the internal state of the AirtableConnector object. This is generally not a good practice as it makes the test fragile and tightly coupled with the implementation. Instead, consider testing the behavior of the object. For example, you could test that the execute method returns the expected results given a certain configuration.

Comment on lines +63 to +72
@patch("requests.get")
def test_execute(self, mock_request_get):
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
execute_data = self.connector.execute()
self.assertEqual(type(execute_data), pd.DataFrame)
self.assertEqual(len(execute_data), 3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_execute method, you are mocking the requests.get method and setting the return value to a predefined JSON response. This is a good practice for isolating the unit under test. However, you should also test the case where the request fails (e.g., the status code is not 200). This will help ensure that your code handles errors correctly.

Comment on lines +73 to +81
@patch("requests.get")
def test_head(self, mock_request_get):
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
execute_data = self.connector.head()
self.assertEqual(type(execute_data), pd.DataFrame)
self.assertLessEqual(len(execute_data), 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the test_execute method, the test_head method should also test the case where the request fails.

Comment on lines +88 to +96
@patch("requests.get")
def test_rows_count_property(self, mock_request_get):
# Test rows_count property
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
rows_count = self.connector.rows_count
self.assertEqual(rows_count, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_rows_count_property method, you are testing the rows_count property of the AirtableConnector object. This is a good practice as it verifies that the property returns the expected value. However, you should also test the case where the request fails.

Comment on lines +98 to +106
@patch("requests.get")
def test_columns_count_property(self, mock_request_get):
# Test columns_count property
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
columns_count = self.connector.columns_count
self.assertEqual(columns_count, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_columns_count_property method, similar to the test_rows_count_property method, you should also test the case where the request fails.

Comment on lines +108 to +111
def test_build_formula_method(self):
formula = self.connector._build_formula()
expected_formula = "AND(Status='In progress')"
self.assertEqual(formula,expected_formula)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_build_formula_method method is testing the _build_formula method of the AirtableConnector object. This is a good practice as it verifies that the method returns the expected value. However, you should also test the case where the where parameter is not provided in the configuration. This will help ensure that your code handles this case correctly.

Comment on lines +113 to +123
@patch("requests.get")
def test_column_hash(self,mock_request_get):
mock_request_get.return_value.json.return_value = json.loads(
self.expected_data_json
)
mock_request_get.return_value.status_code = 200
returned_hash = self.connector.column_hash
self.assertEqual(
returned_hash,
"e4cdc9402a0831fb549d7fdeaaa089b61aeaf61e14b8a044bc027219b2db941e"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test_column_hash method, similar to the test_rows_count_property and test_columns_count_property methods, you should also test the case where the request fails.

@gventuri gventuri merged commit cec0a61 into Sinaptik-AI:main Oct 18, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants