Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.from_numpy() not working with numpy array of strings or categories (possibly user issue) #256

Open
ATL2001 opened this issue Nov 10, 2024 · 6 comments

Comments

@ATL2001
Copy link

ATL2001 commented Nov 10, 2024

Hey Kyle, I was working with the apply_categorical_cmap in lonboard yesterday (which is great) and was having an issue with getting it to work using a pandas series no matter if my series was a string or a categorical datatype. If I create a pyarrow array from the series, and lonboard passes that pyarrow array to arro3 it works, but when I try to send a pandas series or numpy array of strings to arro3 it says it's an unsupported data type. I'm not sure if I'm not doing something properly or if something isn't working as expected.

any ideas?/Thanks!

import sys

from arro3 import core
from arro3.core import Array
import numpy as np
import pyarrow as pa
import pandas as pd

print("python: ", sys.version)
print("arro3:  ", core.__version__)
print("numpy:  ", np.__version__)
print("pyarrow:", pa.__version__)
print("pandas: ", pd.__version__)

## make a pandas series of strings
series = pd.Series(["a", "b","c"])

## convert pandas series to pyarrow array, then to arro3 array works
Array.from_arrow(pa.Array.from_pandas(series)) # works!

## make arro3 array from pandas series, fails
try:
    Array.from_numpy(series)
except Exception as ex:
    print(f"make arro3 array from pandas series exception: {ex}")

## convert pandas series to numpy (dtype object), make arro3 array from numpy, fails
np_array = pd.Series(["a", "b","c"]).to_numpy()
try:
    Array.from_numpy(np_array)
except Exception as ex:
    print(f"make arro3 array from numpy (dtype object) exception: {ex}")

## convert pandas series to numpy (dtype U1), make arro3 array from numpy, fails
np_array = pd.Series(["a", "b","c"]).to_numpy(dtype="U1")
try:
    Array.from_numpy(np_array)
except Exception as ex:
    print(f"make arro3 array from numpy (dtype U1) exception: {ex}")

## convert pandas series of categorical to numpy, make arro3 array from numpy, fails
np_array = pd.Series(["a", "b","c"]).astype('category').to_numpy()
try:
    Array.from_numpy(np_array)
except Exception as ex:
    print(f"make arro3 array from numpy categorical exception: {ex}")
python:  3.11.9 (main, Aug 14 2024, 04:18:20) [MSC v.1929 64 bit (AMD64)]
arro3:   0.4.2
numpy:   2.1.2
pyarrow: 18.0.0
pandas:  2.2.3
make arro3 array from pandas series exception: Unsupported data type object
make arro3 array from numpy (dtype object) exception: Unsupported data type object
make arro3 array from numpy (dtype U1) exception: Unsupported data type <U1
make arro3 array from numpy categorical exception: Unsupported data type object
@kylebarron
Copy link
Owner

It's really hard to infer the Arrow data type of arbitrary python objects. from_numpy will work with any array that has a known, consistent data type, but a Pandas string array is usually stored essentially as a list of Python objects (that's what data type object means).

You can make this work by specifying the type in the Array constructor:

s = pd.Series(["a", "b", "c", "d"])
arr = Array(s, type=DataType.utf8())
pa.array(arr)

This uses the fact that a series is also a sequence.

Unrelated: you should always use ```py on your code blocks for Python code so it gets syntax highlighting.

@kylebarron
Copy link
Owner

We should probably update from_numpy to defer to the constructor automatically. Which would then mean we need an optional type argument in from_numpy.

@ATL2001
Copy link
Author

ATL2001 commented Nov 11, 2024

I'm with you on the object datatype, which is why I tried

np_array = pd.Series(["a", "b","c"]).to_numpy(dtype="U1")
print(np_array.dtype)

which results in the numpy array being "<U1" datatype instead of object (good call on the py in the code block, I'll remember that in the future!)

similar to how

np_array = pd.Series([1,2,3]).to_numpy(dtype=np.int16)
Array.from_numpy(np_array)

results in an arro3.core.Array<Int16>

whereas

np_array = pd.Series([1,2,3]).to_numpy(dtype=np.int32)
Array.from_numpy(np_array)

results in an arro3.core.Array<Int32>

I haven't dug into the arro3 code at all to know if it would be possible to add something to use the numpy Unicode string dtype

@kylebarron
Copy link
Owner

Yeah we'd need to update the if/else block here:

let dtype = array.dtype();
if is_type::<half::f16>(py, &dtype) {
numpy_to_arrow!(half::f16, Float16Type)
} else if is_type::<f32>(py, &dtype) {
numpy_to_arrow!(f32, Float32Type)
} else if is_type::<f64>(py, &dtype) {
numpy_to_arrow!(f64, Float64Type)
} else if is_type::<u8>(py, &dtype) {
numpy_to_arrow!(u8, UInt8Type)
} else if is_type::<u16>(py, &dtype) {
numpy_to_arrow!(u16, UInt16Type)
} else if is_type::<u32>(py, &dtype) {
numpy_to_arrow!(u32, UInt32Type)
} else if is_type::<u64>(py, &dtype) {
numpy_to_arrow!(u64, UInt64Type)
} else if is_type::<i8>(py, &dtype) {
numpy_to_arrow!(i8, Int8Type)
} else if is_type::<i16>(py, &dtype) {
numpy_to_arrow!(i16, Int16Type)
} else if is_type::<i32>(py, &dtype) {
numpy_to_arrow!(i32, Int32Type)
} else if is_type::<i64>(py, &dtype) {
numpy_to_arrow!(i64, Int64Type)
} else if is_type::<bool>(py, &dtype) {
let arr = array.downcast::<PyArray1<bool>>()?;
Ok(Arc::new(BooleanArray::from(arr.to_owned_array().to_vec())))
} else {
Err(PyValueError::new_err(format!("Unsupported data type {}", dtype)).into())
}

so that it checks for string types

@ATL2001
Copy link
Author

ATL2001 commented Nov 11, 2024

I can't promise I can get to it anytime soon, but looking at the rest of it what's there, I think my rust skills (although... rusty (that's a rust joke)) are probably up to the challenge

@kylebarron
Copy link
Owner

That sounds awesome! You're welcome to make an effort and put up a PR, and then I can give you more direction once the PR is up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants