Meaning of Column.offset? #67

maartenbreddels · 2021-09-16T12:26:01Z

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?

jorisvandenbossche · 2021-09-16T12:29:55Z

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

(personally I don't really like that we use the same class for both ..)

jorisvandenbossche · 2021-09-16T12:30:32Z

The docstring seems to indicate it's indeed for chunks:

dataframe-api/protocol/dataframe_protocol.py

Lines 119 to 128 in 27b8e1c

    
               @property 
        
               def offset(self) -> int: 
        
                   """ 
        
                   Offset of first element. 
        
                   May be > 0 if using chunks; for example for a column with N chunks of 
        
                   equal size M (only the last chunk may be shorter), 
        
                   ``offset = n * M``, ``n = 0 .. N-1``. 
        
                   """ 
        
                   pass

maartenbreddels · 2021-09-16T12:32:24Z

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

jorisvandenbossche · 2021-09-16T12:47:37Z

Ah, sorry I wasn't thinking about the case where your original data isn't chunked but you could return it in chunks.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer? Given that it says "may be > 0 if using chunks", it might actually be the second (your interpretation)

rgommers · 2021-09-22T11:29:15Z

It was indeed meant for supporting an offset into a data buffer - this could be for chunking, or perhaps for other reasons like returning a subset of rows from the original dataframe/buffer and not wanting to create a new buffer.

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

Yes indeed. Although in practice I think chunks are normally coming from different buffers, because if all data fits in a single buffer then chunking isn't necessary.

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?

Same basic principle, but Column.offset is just a single value. Column.get_buffers returns an "offsets" buffer that's for variable-length data:

            - "offsets": a two-element tuple whose first element is a buffer
                         containing the offset values for variable-size binary
                         data (e.g., variable-length strings) and whose second
                         element is the offsets buffer's associated dtype. None
                         if the data buffer does not have an associated offsets
                         buffer.

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

That's not it, I hope the docstring is clear enough. If not, we should extend it.

(personally I don't really like that we use the same class for both ..)

Agreed, it was a bit of a compromise between "I want one class per concept" and "I want as few classes as possible" opinions.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer?

The latter.

maartenbreddels mentioned this issue Sep 16, 2021

Draft for the Dataframe interchange protocol vaexio/vaex#1509

Merged

4 tasks

rgommers added the interchange-protocol label Sep 21, 2021

dchigarev mentioned this issue Feb 28, 2022

FEAT-#4144: Implement dataframe exchange protocol for pandas storage format modin-project/modin#4150

Merged

8 tasks

cnpryer mentioned this issue Jun 28, 2022

WIP: Implement DataFrame Interchange Protocol pola-rs/polars#3727

Closed

3 tasks

stinodego mentioned this issue Dec 1, 2022

feat(python): DataFrame interchange protocol implementation pola-rs/polars#5662

Closed

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaning of Column.offset? #67

Meaning of Column.offset? #67

maartenbreddels commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

maartenbreddels commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

rgommers commented Sep 22, 2021

Meaning of Column.offset? #67

Meaning of Column.offset? #67

Comments

maartenbreddels commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

maartenbreddels commented Sep 16, 2021

jorisvandenbossche commented Sep 16, 2021

rgommers commented Sep 22, 2021