-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dataframe interchange protocol #4244
Conversation
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
…erchange-protocol
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
a5f792b
to
9b2154e
Compare
@@ -233,6 +246,24 @@ def test_build_df_with_index(): | |||
assert_frame_equal(tips.reset_index()[out["data_frame"].columns], out["data_frame"]) | |||
|
|||
|
|||
def test_build_df_using_interchange_protocol_mock(add_interchange_module_for_old_pandas): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a mock test, "plotly" do not need to have a test for every library that supports this protocol.
Signed-off-by: Anatoly Myachev <[email protected]>
This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have |
You're right. The new path will be used even if the library has |
@nicolaskruchten CI is green, may I know what do you think about merging it? |
I'll endorse this PR but I'll defer to @alexcjohnson and @LiamConnors for timing on when to merge it, how to document it etc :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💃 This seems good to me in principle although I haven't tested it manually and I don't have a strong sense of what the list is of which non-Pandas dataframes will go through the new codepath and which will be zero-copy, so it's not super-clear to me how to document this.
This PR also needs a changelog entry (which should probably contain some of the answers to the questions above)
Hi @alexcjohnson! May I have your opinion on this pull request? |
I haven't looked enough into what you do with the DataFrame objects, but I just wanted to note that if
So it may be more user-friendly to first try |
|
Right, and that's not the default (I thought it was - sorry, I should've checked) OK, then using the interchange protocol wouldn't be a regression for polars Maybe a separate PR could be made so that if it is a polars dataframe, then But that's a separate discussion, sorry everyone for the noise +1 for proceeding with this! |
This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance. That and a changelog entry 😎 |
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
CHANGELOG.md
Outdated
@@ -17,6 +17,7 @@ This project adheres to [Semantic Versioning](http://semver.org/). | |||
- Add `legend.xref` and `legend.yref` to enable container-referenced positioning of legends [[#6589](https://github.com/plotly/plotly.js/pull/6589)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development. | |||
- Add `colorbar.xref` and `colorbar.yref` to enable container-referenced positioning of colorbars [[#6593](https://github.com/plotly/plotly.js/pull/6593)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development. | |||
- `px` methods now accept data-frame-like objects that support a `to_pandas()` method, such as polars, cudf, vaex etc | |||
- `px` methods now accept data-frame-like objects that support a [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html), such as polars, vaex, modin etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if cudf supports this protocol
…erchange-protocol
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@@ -1,6 +1,6 @@ | |||
requests==2.25.1 | |||
tenacity==6.2.0 | |||
pandas==2.0.1 | |||
pandas==2.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, the protocol cannot be used in tests (for example, vaex test).
Signed-off-by: Anatoly Myachev <[email protected]>
…from_vaex' test Signed-off-by: Anatoly Myachev <[email protected]>
@alexcjohnson thanks for the answer! I made the changes you mentioned. Ready for review. |
Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :) |
@ritchie46 Sure, I can add. Does polars have a function like |
Yes, there is |
Signed-off-by: Anatoly Myachev <[email protected]>
@alexcjohnson friendly ping :) |
I tried this out by creating a new venv and installing
|
Hmm we definitely don't want a hard dependency on |
pyarrow isn't strictly required for interchanging to pandas, I think it's just that polars implements the interchange protocol by going via pyarrow's implementation of it why not swap the branches, and try |
@LiamConnors thanks for the feedback.
I think right now I can rewrite the logic as follows (so that when the version of try:
pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplemetedError):
df_not_pandas.to_pandas()
@MarcoGorelli I guess we should try the more standardized and basic way first than specialized. In the case when the dataframe library has both options implemented, then the execution of the code will not reach a potentially better way. |
Signed-off-by: Anatoly Myachev <[email protected]>
@nicolaskruchten CI build failed for another reason. Could you manually restart it if possible? |
Co-authored-by: Alex Johnson <[email protected]>
@LiamConnors thanks for testing this out - would you try again, both without |
@@ -19,3 +19,5 @@ matplotlib==2.2.3 | |||
scikit-image==0.18.1 | |||
psutil==5.7.0 | |||
kaleido | |||
vaex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you not add modin here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_build_df_using_interchange_protocol_mock
test should be enough for modin. On the other hand, I do not want to expand the list of dependencies, there are already enough of them.
I guess this was already there, because I now get this:
|
Do I understand correctly that this should not stop the merge of this pull request? |
df_not_pandas = args["data_frame"] | ||
try: | ||
df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas) | ||
except (ImportError, NotImplementedError) as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want ModuleNotFoundError
, else you'll get the reported error if pyarrow
isn't installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModuleNotFoundError
comes from to_pandas
call. The same behavior should already be on the master branch.
Stack from error above:
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
df_pandas = df_not_pandas.to_pandas()
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right - it doesn't make any difference then, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're OK:
>>> issubclass(ModuleNotFoundError, ImportError)
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with pyarrow
installed and didn't encounter any other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this should not stop the merge of this pull request?
Yep, at that point we're out of fallbacks, all we can do is allow that error to propagate, so the user is alerted to install pyarrow
if they want to use this feature.
💃 Merging, we'll likely release this with the next plotly.js, which I'm guessing will be in a week or so. Thanks again @anmyachev, and everyone else for your comments.
Thanks for the review! |
This pull request continues the work that began in #3387.