-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature requests #195
Comments
This is relatively easy to do; you just have to map input into a
I've done this for other bindings (e.g. https://geoarrow.org/geoarrow-rs/python/latest/api/io/functions/#geoarrow.rust.io.ParquetFile and https://kylebarron.dev/parquet-wasm/classes/bundler_parquet_wasm.ParquetFile.html). I'd say it's in scope but I'm not sure of a good way to expose both sync and async methods from a single class.
This should be relatively easy. I already started prototyping this in #156, which adds a
The goal isn't to have the scope strictly as small as possible, but rather have minimal new code here. E.g. if the functionality already exists as a compute function in The core module (
dt compute functions are pretty easy to implement. See
This is a bit more complex because we'd need more work on the binding side to be able to express the parquet filter to pass to Rust. Let's work on the other topics first and come back to this. PRs for these are welcome if you're interested; preferably one at a time |
I mean in the python API, of course. So input would be a list of strings.
I think the latest pyo3 should be able to pass bytes-like objects back and forth between python and rust without copying, and you no longer need to implement Buffer.
There seems to be less functionality there than via pyarrow compute. |
The Python API needs to map some Python input into a Rust
For data export (Rust to Python) we want to expose existing Arrow The
Yes, indeed. There's relatively less functionality in core |
Agreed. List[str] is what pyarrow supports, but it cares about columns, not nested things. So I would say that either the dotted approach or list[list[str]] approach should be implemented, where the former would need some way to deal with schema name parts including ".". Integers would be fine at the rust API border, but I don't think any end user wants to do that, they think in fields/columns.
Right, it's one half of solution only.
Totally agree
A python list would have done :). Eventually each buffer is just a uint8 slice after all.
I believe you, but why would CPU kernels ever need async? |
I'd probably go for
I don't follow. There's no place here where a Python list could be zero-copy, either for import or export.
DataFusion is a pluggable query engine. It's not the kernels themselves that need async, but the fact that they're part of a query engine that might need async to access the data, and which uses an async engine in its scheduler. It might be possible to use some of DataFusion's operations directly in a sync manner, but I doubt it. I think you'd need to use the main DataFusion APIs, which are async. |
+1
A python list containing buffers (memoryview, bytearray, etc): you can iterate over each item and try_into to test it has the correct type. These days, maybe it casts directly to Vec of PyBackedBytes |
I still don't understand what you're saying... #[pyfunction]
fn accept_buffer_protocol(py: Python, buf: PyBuffer<u8>) {
dbg!(buf.to_vec(py).unwrap());
} Running that gives this error: import numpy as np
from arro3.compute import accept_buffer_protocol
arr = np.array([1, 2, 3], dtype=np.uint8)
accept_buffer_protocol(arr)
# [arro3-compute/src/lib.rs:54:5] buf.to_vec(py).unwrap() = [
# 1,
# 2,
# 3,
# ]
arr = np.array([1, 2, 3], dtype=np.uint64)
accept_buffer_protocol(arr)
# ---------------------------------------------------------------------------
# BufferError Traceback (most recent call last)
# File /Users/kyle/github/kylebarron/arro3/tests/tmp.py:1
# ----> 1 accept_buffer_protocol(arr)
# BufferError: buffer contents are not compatible with u8 And if I try a function with a PyList input, it doesn't work import numpy as np
from arro3.compute import accept_buffer_protocol
arr = np.array([1, 2, 3], dtype=np.uint8)
accept_buffer_protocol(arr)
# ---------------------------------------------------------------------------
# TypeError Traceback (most recent call last)
# File /Users/kyle/github/kylebarron/arro3/tests/tmp.py:1
# ----> 1 accept_buffer_protocol(arr)
# 2 # [arro3-compute/src/lib.rs:54:5] buf.to_vec(py).unwrap() = [
# 3 # 1,
# 4 # 2,
# 5 # 3,
# 6 # ]
# TypeError: argument 'buf': 'ndarray' object cannot be converted to 'PyList' |
I mean something like x: PyList
let bufs: Vec<PyBackedBuffer> = x.iter().map(|buf|buf.try_into()?)).collect() but as I say, this may "just work" with But yes, I imagine those buffers are always u8, and that this is a requirement, i.e., the calling code would need to use |
An interesting implementation of cross-language buffers: https://github.com/milesgranger/cramjam/blob/master/src/io.rs ; functions can generally take python objects or rust-side RustyBuffers. This has the additional feature of "read/seek" file-like methods, which you don't need. |
PyBackedBuffer does not support buffer protocol objects; it only supports I think this is for safety reasons because technically Rust needs the external buffers to be immutable. I figured out another approach from your suggestion, which is probably cleaner than vendoring upstream pub enum AnyBufferProtocol {
UInt8(PyBuffer<u8>),
UInt16(PyBuffer<u16>),
UInt32(PyBuffer<u32>),
UInt64(PyBuffer<u64>),
Int8(PyBuffer<i8>),
Int16(PyBuffer<i16>),
Int32(PyBuffer<i32>),
Int64(PyBuffer<i64>),
Float32(PyBuffer<f32>),
Float64(PyBuffer<f64>),
}
impl<'py> FromPyObject<'py> for AnyBufferProtocol {
fn extract_bound(ob: &Bound<'py, PyAny>) -> PyResult<Self> {
if let Ok(buf) = ob.extract::<PyBuffer<u8>>() {
Ok(Self::UInt8(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<u16>>() {
Ok(Self::UInt16(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<u32>>() {
Ok(Self::UInt32(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<u64>>() {
Ok(Self::UInt64(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<i8>>() {
Ok(Self::Int8(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<i16>>() {
Ok(Self::Int16(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<i32>>() {
Ok(Self::Int32(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<i64>>() {
Ok(Self::Int64(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<f32>>() {
Ok(Self::Float32(buf))
} else if let Ok(buf) = ob.extract::<PyBuffer<f64>>() {
Ok(Self::Float64(buf))
} else {
Err(PyValueError::new_err("Not a buffer protocol object"))
}
}
} This will have a slight amount of overhead to try to extract all of those different buffer types, but that's probably on the order of microseconds. Note that these are physical types, so we still need to check for the logical type (e.g. a boolean-typed buffer protocol object will still export as a |
Are the buffers required by the arrow internal implementation really typed? I would have assumed they are all always u8. Note that buffers-on-the-edge was written two years ago, and pyo3 has done a lot since. |
At the core, the arrow
I don't think this in particular has changed: PyO3/pyo3#2824 (comment) |
Then we are at cross purposes - I would expect that to be defined in the schema, and indeed pass byte buffers.
Except that we always have immutable buffers like you say, so we can make some stronger guarantees. I also made one (incomplete) of these, by the way, and thought at the time that there must be a better way! |
I think it's very important for end-user usability to allow this, because it makes this just work in many cases with raw numeric data. If you know what you're doing and want to do something different, you can view the numpy array as This conversation did get me to look at the existing buffer protocol implementation in arro3 and I revamped it in #204. So now we can interpret buffer protocol objects zero copy as Arrow arrays! This is an improvement from before, because previously we would always copy buffer protocol input into newly-allocated Arrow input. |
Agreed, but isn't it easier to handle that on the python side? |
Doesn't that require the user to do extra steps? We don't have a Python side to arro3. Additionally, the nice part of making this a pure-rust implementation is that it's not just in arro3... it's also in pyo3-arrow. So anyone else creating a Python library and using pyo3-arrow will automatically support numpy arrays in all their primitive-type arrow APIs. It's not that much code on the Rust side either |
I don't think groked that there was no python at all! Unfortunately, my attempts at rust-python coding are not idiomatic in rust at all. In that specific case, there is a good reason to have quite a lot on the python side. I wonder if you happen to know of a good resource to learn my way around a high-level rust codebase (i.e., with many layers of traits and generics)? |
There is virtually no Python. The only Python is a couple user-facing enums and types, and stub files for static type hinting.
There are two parts to the arro3 repo. There's If I implement Python code in arro3, it can't be reused by other Rust developers. Meanwhile, the entire core of So if you create a function in your own library with use pyo3::prelude::*;
use pyo3_arrow::PyArray;
#[pyfunction]
pub fn my_python_function(array: PyArray) {} Then that function will automatically accept any Arrow array (from any library implementing the Arrow PyCapsule Interface) or from any buffer protocol object.
The best way to have idiomatic Python-Rust is to implement the |
@kylebarron , sorry to bother you with this but it seems the right place to ask given the above - given the docs are focused on using Python as the driver, does this library in its current form support having rust as the driver calling python? EDIT: To clarify, this question relates to https://crates.io/crates/pyo3-arrow , which links back to this repo. |
@pbower please create a new issue to discuss this. |
It would be great:
Doing all of this would essentially answer what is envisaged in dask/fastparquet#931 : getting what we really need out of arrow without the cruft. It would interoperate nicely with
awkward
, for example.Other nice to haves (and I realise you wish to keep the scope as small as possible)
The text was updated successfully, but these errors were encountered: