Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) #39195

Open
5 of 8 tasks
jorisvandenbossche opened this issue Dec 12, 2023 · 4 comments
Open
5 of 8 tasks

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 12, 2023

#37797 added a Python layer for working with the C Data Interface through capsules and defined dunder methods, described at https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html.

Currently, libraries are using pyarrow's _export_to_c/_import_from_c semi-private methods to get the C Data Interface structs / construct pyarrow object from the structs.
With the formalized PyCapsule protocol, the idea is that this communication of the structs now happens with defined dunder methods __arrow_c_schema__ / __arrow_c_array__ / __arrow_c_stream__ to get the data as a C Data Interface struct wrapped into a capsule instead as raw integer (export). In addition, the pyarrow constructors (pa.array(..), pa.table(..), ..) will now also accept any object implementing the relevant dunder method (import).

This new Python-level interface has the benefits to be 1) more robust (not passing around pointers as integers, the capsule will ensure the data is released when an error would happen in the middle), and 2) not tied to pyarrow the library. This means that other libraries can also start accepting any Arrow-compatible data structure, instead of harcoding support for pyarrow and using pyarrow's _export_to_c.

We want to promote other libraries to start supporting this protocol as well, which can mean:

  • Add the dunders on their own objects where appropriate (so that other libraries will recognize your data structures as holding Arrow-compatible data)
  • Where ingesting data (eg where there might be specific pyarrow support right now), recognize objects that implement the dunder methods

Linking some issues here: pola-rs/polars#12530, data-apis/dataframe-api#279, pandas-dev/pandas#56587, OSGeo/gdal#9043 and OSGeo/gdal#9132

In addition, there are also some steps we can take within the Arrow project itself to further promote this:

There are also some other follow-up discussions about certain aspects of the protocol or APIs we provide in pyarrow:

@pitrou
Copy link
Member

pitrou commented Dec 13, 2023

cc @wjones127

@kylebarron
Copy link
Contributor

I made a feature request for duckdb here: duckdb/duckdb#10716

@kylebarron
Copy link
Contributor

kylebarron commented Jul 23, 2024

I've been working a bit to promote the protocol; here's a running tally:

implemented: (at least partially, return objects with pycapsule protocol and/or check for existence of protocol in constructors)

in-progress:

issues:

@WillAyd
Copy link
Contributor

WillAyd commented Dec 19, 2024

@kylebarron list is awesome - is there a more formal home in the documentation we can put that?

I'd also add pantab to the list https://github.com/innobi/pantab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants