-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array.from_numpy() not working with numpy array of strings or categories (possibly user issue) #256
Comments
It's really hard to infer the Arrow data type of arbitrary python objects. You can make this work by specifying the s = pd.Series(["a", "b", "c", "d"])
arr = Array(s, type=DataType.utf8())
pa.array(arr) This uses the fact that a series is also a sequence. Unrelated: you should always use |
We should probably update |
I'm with you on the object datatype, which is why I tried np_array = pd.Series(["a", "b","c"]).to_numpy(dtype="U1")
print(np_array.dtype) which results in the numpy array being "<U1" datatype instead of object (good call on the py in the code block, I'll remember that in the future!) similar to how np_array = pd.Series([1,2,3]).to_numpy(dtype=np.int16)
Array.from_numpy(np_array) results in an arro3.core.Array<Int16> whereas np_array = pd.Series([1,2,3]).to_numpy(dtype=np.int32)
Array.from_numpy(np_array) results in an arro3.core.Array<Int32> I haven't dug into the arro3 code at all to know if it would be possible to add something to use the numpy Unicode string dtype |
Yeah we'd need to update the if/else block here: arro3/pyo3-arrow/src/interop/numpy/from_numpy.rs Lines 26 to 54 in 652eb6d
so that it checks for string types |
I can't promise I can get to it anytime soon, but looking at the rest of it what's there, I think my rust skills (although... rusty (that's a rust joke)) are probably up to the challenge |
That sounds awesome! You're welcome to make an effort and put up a PR, and then I can give you more direction once the PR is up. |
Hey Kyle, I was working with the apply_categorical_cmap in lonboard yesterday (which is great) and was having an issue with getting it to work using a pandas series no matter if my series was a string or a categorical datatype. If I create a pyarrow array from the series, and lonboard passes that pyarrow array to arro3 it works, but when I try to send a pandas series or numpy array of strings to arro3 it says it's an unsupported data type. I'm not sure if I'm not doing something properly or if something isn't working as expected.
any ideas?/Thanks!
The text was updated successfully, but these errors were encountered: