Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export to Pandas #249

Open
MarcoGorelli opened this issue Oct 21, 2024 · 1 comment
Open

Export to Pandas #249

MarcoGorelli opened this issue Oct 21, 2024 · 1 comment

Comments

@MarcoGorelli
Copy link

Say I have an object df which implements __arrow_c_stream__ (say, a Polars dataframe)

Currently, I can convert that to pandas by doing:

import pyarrow as pa

pa.table(df).to_pandas()

In this case, I would only be using PyArrow as a container, without needing all of its compute

Does arro3 provide a way to convert from Polars to pandas without having to go via PyArrow?

@kylebarron
Copy link
Owner

Short answer: close, but not stable and tested enough to put in prod. And no top-level to_pandas API (yet).

There are a couple ways of converting Arrow data to Numpy/Python scalars, but beyond that, there's specifics around pandas semantics that I don't know well enough. At this point, it might be easiest to just look at how pyarrow implements it and port that, because there will be specifics around null handling I assume. But I'd want to see how many LOC the pyarrow implementation is.

Scalar.as_py

This is the most well-supported way right now. Only interval, decimal, and REE arrays are not currently supported.

So you could call this for every scalar and pass that to a pd.DataFrame constructor, but that would be pretty slow.

Array.to_numpy/ChunkedArray.to_numpy

This is implemented for primitive numeric types:

DataType::Float16 => impl_primitive!(Float16Type),
DataType::Float32 => impl_primitive!(Float32Type),
DataType::Float64 => impl_primitive!(Float64Type),
DataType::UInt8 => impl_primitive!(UInt8Type),
DataType::UInt16 => impl_primitive!(UInt16Type),
DataType::UInt32 => impl_primitive!(UInt32Type),
DataType::UInt64 => impl_primitive!(UInt64Type),
DataType::Int8 => impl_primitive!(Int8Type),
DataType::Int16 => impl_primitive!(Int16Type),
DataType::Int32 => impl_primitive!(Int32Type),
DataType::Int64 => impl_primitive!(Int64Type),

That function could for non-numeric types just fall back to creating a list of Python scalars and passing that to the numpy constructor.

@kylebarron kylebarron changed the title does arro3 provide a way to convert between a pycapsule-compliant object and pandas? Export to Pandas Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants