-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meaning of Column.offset? #67
Comments
Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are? (personally I don't really like that we use the same class for both ..) |
The docstring seems to indicate it's indeed for chunks: dataframe-api/protocol/dataframe_protocol.py Lines 119 to 128 in 27b8e1c
|
So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right? |
Ah, sorry I wasn't thinking about the case where your original data isn't chunked but you could return it in chunks. Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer? Given that it says "may be > 0 if using chunks", it might actually be the second (your interpretation) |
It was indeed meant for supporting an offset into a data buffer - this could be for chunking, or perhaps for other reasons like returning a subset of rows from the original dataframe/buffer and not wanting to create a new buffer.
Yes indeed. Although in practice I think chunks are normally coming from different buffers, because if all data fits in a single buffer then chunking isn't necessary.
Same basic principle, but
That's not it, I hope the docstring is clear enough. If not, we should extend it.
Agreed, it was a bit of a compromise between "I want one class per concept" and "I want as few classes as possible" opinions.
The latter. |
Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?
The text was updated successfully, but these errors were encountered: