Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataframe interchange protocol #4244

Merged
merged 17 commits into from
Jun 30, 2023

Conversation

anmyachev
Copy link
Contributor

This pull request continues the work that began in #3387.

maartenbreddels and others added 5 commits June 12, 2023 18:10
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html

Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@@ -233,6 +246,24 @@ def test_build_df_with_index():
assert_frame_equal(tips.reset_index()[out["data_frame"].columns], out["data_frame"])


def test_build_df_using_interchange_protocol_mock(add_interchange_module_for_old_pandas):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mock test, "plotly" do not need to have a test for every library that supports this protocol.

Signed-off-by: Anatoly Myachev <[email protected]>
@nicolaskruchten
Copy link
Contributor

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

@anmyachev
Copy link
Contributor Author

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

You're right. The new path will be used even if the library has to_pandas method. I believe this is the preferred path, as this protocol aims to become a widely accepted standard, which should be accompanied by good support. From a performance point of view, it seems most likely that this way could be more performant (for example, in the protocol itself there is the following mention: Must be zero-copy wherever possible), at least as good as a simple to_pandas call.

@anmyachev
Copy link
Contributor Author

@nicolaskruchten CI is green, may I know what do you think about merging it?

@nicolaskruchten
Copy link
Contributor

I'll endorse this PR but I'll defer to @alexcjohnson and @LiamConnors for timing on when to merge it, how to document it etc :)

@nicolaskruchten nicolaskruchten self-requested a review June 15, 2023 20:17
Copy link
Contributor

@nicolaskruchten nicolaskruchten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 This seems good to me in principle although I haven't tested it manually and I don't have a strong sense of what the list is of which non-Pandas dataframes will go through the new codepath and which will be zero-copy, so it's not super-clear to me how to document this.

This PR also needs a changelog entry (which should probably contain some of the answers to the questions above)

@anmyachev
Copy link
Contributor Author

Hi @alexcjohnson! May I have your opinion on this pull request?

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 20, 2023

I haven't looked enough into what you do with the DataFrame objects, but I just wanted to note that if df is a polars DataFrame, then:

  • pd.api.interchange.from_dataframe(df) will probably make some copy (currently, the result doesn't use pyarrow)
  • df.to_pandas() should typically be zero-copy

So it may be more user-friendly to first try to_pandas, and then fallback to the interchange protocl? (cc @ritchie46 in case I got this wrong)

@ritchie46
Copy link

df.to_pandas() should typically be zero-copy

df.to_pandas(use_pyarrow_extension_array=True) is zero copy yes.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 22, 2023

Right, and that's not the default (I thought it was - sorry, I should've checked)

OK, then using the interchange protocol wouldn't be a regression for polars

Maybe a separate PR could be made so that if it is a polars dataframe, then to_pandas with use_pyarrow_extension_array=True should be passed

But that's a separate discussion, sorry everyone for the noise

+1 for proceeding with this!

@alexcjohnson
Copy link
Collaborator

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
CHANGELOG.md Outdated
@@ -17,6 +17,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
- Add `legend.xref` and `legend.yref` to enable container-referenced positioning of legends [[#6589](https://github.com/plotly/plotly.js/pull/6589)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.
- Add `colorbar.xref` and `colorbar.yref` to enable container-referenced positioning of colorbars [[#6593](https://github.com/plotly/plotly.js/pull/6593)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.
- `px` methods now accept data-frame-like objects that support a `to_pandas()` method, such as polars, cudf, vaex etc
- `px` methods now accept data-frame-like objects that support a [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html), such as polars, vaex, modin etc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if cudf supports this protocol

@@ -1,6 +1,6 @@
requests==2.25.1
tenacity==6.2.0
pandas==2.0.1
pandas==2.0.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, the protocol cannot be used in tests (for example, vaex test).

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

@alexcjohnson thanks for the answer! I made the changes you mentioned. Ready for review.

@ritchie46
Copy link

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

@anmyachev
Copy link
Contributor Author

anmyachev commented Jun 26, 2023

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

@ritchie46 Sure, I can add. Does polars have a function like from_pandas in vaex? (for example, vaex.from_pandas(iris_pandas) call in the test)

@ritchie46
Copy link

Yes, there is pl.from_pandas. 👍

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@alexcjohnson friendly ping :)

@LiamConnors
Copy link
Member

I tried this out by creating a new venv and installing pandas and polars. When I run an app I get the following traceback asking for pyarrow>=11.0.0

Traceback (most recent call last):
  File "app.py", line 19, in <module>
    fig = px.scatter(df, x="date", y="float")
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
    return make_figure(args=locals(), constructor=go.Scatter)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2012, in make_figure
    args = build_dataframe(args, constructor)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1317, in build_dataframe
    df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
    raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.

@nicolaskruchten
Copy link
Contributor

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

@MarcoGorelli
Copy link
Contributor

pyarrow isn't strictly required for interchanging to pandas, I think it's just that polars implements the interchange protocol by going via pyarrow's implementation of it

why not swap the branches, and try to_pandas first?

@anmyachev
Copy link
Contributor Author

@LiamConnors thanks for the feedback.

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

I think right now I can rewrite the logic as follows (so that when the version of plotly is updated, the code does not stop working for users):

try:
    pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplemetedError):
    df_not_pandas.to_pandas()

why not swap the branches, and try to_pandas first?

@MarcoGorelli I guess we should try the more standardized and basic way first than specialized. In the case when the dataframe library has both options implemented, then the execution of the code will not reach a potentially better way.

@anmyachev
Copy link
Contributor Author

@nicolaskruchten CI build failed for another reason. Could you manually restart it if possible?

@alexcjohnson
Copy link
Collaborator

@LiamConnors thanks for testing this out - would you try again, both without pyarrow and then with it installed? AFAICT the logic is solid at this point, so if you're happy with how it works I'll be happy to approve and merge. Thanks for the iterations @anmyachev 🙇

@@ -19,3 +19,5 @@ matplotlib==2.2.3
scikit-image==0.18.1
psutil==5.7.0
kaleido
vaex
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not add modin here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_build_df_using_interchange_protocol_mock test should be enough for modin. On the other hand, I do not want to expand the list of dependencies, there are already enough of them.

@LiamConnors
Copy link
Member

I guess this was already there, because I now get this:

Traceback (most recent call last):
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
    df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
    raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "app.py", line 19, in <module>
    fig = px.scatter(df, x="date", y="float")
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
    return make_figure(args=locals(), constructor=go.Scatter)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
    args = build_dataframe(args, constructor)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
    df_pandas = df_not_pandas.to_pandas()
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
    record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

@anmyachev
Copy link
Contributor Author

I guess this was already there, because I now get this:

Traceback (most recent call last):
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
    df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
    raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "app.py", line 19, in <module>
    fig = px.scatter(df, x="date", y="float")
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
    return make_figure(args=locals(), constructor=go.Scatter)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
    args = build_dataframe(args, constructor)
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
    df_pandas = df_not_pandas.to_pandas()
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
    record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Do I understand correctly that this should not stop the merge of this pull request?

df_not_pandas = args["data_frame"]
try:
df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplementedError) as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want ModuleNotFoundError, else you'll get the reported error if pyarrow isn't installed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleNotFoundError comes from to_pandas call. The same behavior should already be on the master branch.

Stack from error above:

  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
    df_pandas = df_not_pandas.to_pandas()
  File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
    record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right - it doesn't make any difference then, thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're OK:

>>> issubclass(ModuleNotFoundError, ImportError)
True

@LiamConnors LiamConnors self-requested a review June 30, 2023 18:41
Copy link
Member

@LiamConnors LiamConnors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with pyarrow installed and didn't encounter any other issues.

Copy link
Collaborator

@alexcjohnson alexcjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this should not stop the merge of this pull request?

Yep, at that point we're out of fallbacks, all we can do is allow that error to propagate, so the user is alerted to install pyarrow if they want to use this feature.

💃 Merging, we'll likely release this with the next plotly.js, which I'm guessing will be in a week or so. Thanks again @anmyachev, and everyone else for your comments.

@alexcjohnson alexcjohnson merged commit 6a9bbea into plotly:master Jun 30, 2023
@anmyachev
Copy link
Contributor Author

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants