-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for the Dataframe interchange protocol #1509
Conversation
8c52ceb
to
48c34f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, I gave it a first pass, hope that helps.
if self._col.values[0] in self.labels: | ||
for i in self._col.values: | ||
codes[np.where(codes==i)] = np.where(self.labels == i) # if values are same as labels | ||
else: | ||
codes = self._col.values # values are already codes for the labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is needed. Could you explain what you are trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this will not be needed when I will be able to use expressions.codes.
But for now, if I take this example:
df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
and use .values
function on year
and weekday
column I get two different types of output. In the first case it is a list of labels and in the other case I get the codes.
array([2012, 2015, 2019])
array([0, 4, 6])
To get the codes for the year
column I calculate them as in line 402-404.
For now I can't seem to find another way.
48c34f6
to
954c4bb
Compare
44fc157
to
f74a168
Compare
# If it is internal, kind must be categorical (23) | ||
# If it is external (call from_dataframe), dtype must give type of the data | ||
if self._col.df.is_category(self._col): | ||
return (_DtypeKind.CATEGORICAL, 64, 'u', '=') # what should be the default?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have to be revised. See first two items in data-apis/dataframe-api#46 (comment)
|
||
# Join the chunks into tuple for now | ||
df_new = vaex.concat(dataframe) | ||
df_new._buffers = _buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nested list may be hard to read. Could be changed to a list of dictionaries with column names as keys if that would be better.
c06da90
to
14fa9a4
Compare
@maartenbreddels @JovanVeljanoski don't know if it will stay green =) nevertheless I think this is ready for a review. |
Convert an int, uint, float or bool column to an arrow array | ||
""" | ||
if col.offset != 0: | ||
raise NotImplementedError("column.offset > 0 not handled yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is offset used in Vaex?
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
14fa9a4
to
32a401c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! my biggest comment/change was the change of the describe_null, which can be 3 options (no null, np.ma or arrow). I hope you agree with that.
Would love to see string support!
PS: Please keep the line endings as line feeds.
if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)): | ||
buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool)) | ||
else: | ||
buffer = _VaexBuffer(self._col.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could make a copy. We can use .values/.evaluate()
but then we'd have to support arrow and numpy arrays, which should be fine, maybe the following works:
buffer = _VaexBuffer(self._col.to_numpy()) | |
buffer = _VaexBuffer(np.asarray(self._col.evaluate(), dtype=bool)) |
# If arrow array is boolean .to_numpy changes values for some reason | ||
# For that reason data is transferred to numpy through .tolist | ||
if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)): | ||
buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why that happens, I'll see if a test fails because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should restate: the values changed in a strange way when transferring data with buffer_to_ndarray
in the case of a bool arrow array if .to_numpy
was used.
But I will try to use previous suggestion (.values/.evaluate()
) for arrow and numpy and I will delete the comment if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the same strange change of values for evaluate()
also. It happens in the case of arrow arrays (int and bool) with missing values.
df = vaex.from_arrays(
arrow_int_m = pa.array([0, 1, 2, None, 0], mask=np.array([0, 0, 0, 1, 1], dtype=bool)),
arrow_float_m = pa.array([0.5, 1.5, 2.5, None, 0.5], mask=np.array([0, 0, 0, 1, 0], dtype=bool)),
arrow_bool_m = pa.array([True, False, True, None, True], mask=np.array([0, 0, 1, 1, 0], dtype=bool))
)
col = df.__dataframe__().get_column_by_name('arrow_int_m')
b, d = col.get_buffers()["data"]
buffer_to_ndarray(b, d)
output:
array([ 0, 4607182418800017408, 4611686018427387904,
-2251799813685248, -2251799813685248], dtype=int64)
I can commit with the error so you can have a look?
size = self.num_rows() | ||
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks) | ||
iterator = [] | ||
for i1, i2, chunk in i: | ||
iterator.append(_VaexColumn(self._df[i1:i2])) | ||
return iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood, but I think we can simply do
size = self.num_rows() | |
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks) | |
iterator = [] | |
for i1, i2, chunk in i: | |
iterator.append(_VaexColumn(self._df[i1:i2])) | |
return iterator | |
yield self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this part thinking one could read a dataframe (that is not chunked) in chunks specifying the n_chunks
(item 12 from protocol-design-requirements).
But there are some errors left, sorry about that. It should be:
def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["_VaexDataFrame"]:
"""
Return an iterator yielding the chunks.
TODO: details on ``n_chunks``
"""
if n_chunks == None:
size = self.num_rows()
n_chunks = self.num_chunks()
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
iterator = []
for i1, i2, chunk in i:
iterator.append(_VaexDataFrame(self._df[i1:i2]))
return iterator
elif self.num_chunks() == 1:
size = self.num_rows()
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
iterator = []
for i1, i2, chunk in i:
iterator.append(_VaexDataFrame(self._df[i1:i2]))
return iterator
else:
raise ValueError("Dataframe is already chunked.")
00cab17
to
ba8cef8
Compare
Awesome to see the strings coming in, getting close! |
1133384
to
a81f4f5
Compare
@maartenbreddels I added the method suggested to produce the buffers for string dtpe. Also added the mask handling in the |
106fe28
to
1fe648e
Compare
…l_ordinal for categorical dtypes
1fe648e
to
0df4b34
Compare
@AlenkaF many many things for your work, this is actually already released in vaex-core 4.6.0a3 |
This is really great to see - thanks a lot @AlenkaF and @maartenbreddels! |
* 🐛 handle offset for categories * Draft for the Dataframe interchange protocol * Adding test for virtual column plus typo. * Roundtrip test change plus some corrections in functions parameters * Apply suggestions from code review * Dtype for arrow dict plus use of arrow dict in convert_categorical_column * Add missing value handling * Added chunk handling and tests * Corrected usage of metadata for categories * Applying changes from general dataframe protocol * Delete copy error * Change sentinel value handling in convert_categorical_column * Add select_columns() and test * Update to _get_data_buffer() for Arrow Dictionary * Minor commenting changes * Correct typo error * Add _VaexBuffer test * Add tests and correction for _VaexColumn * Added tests for _VaexDataFrame * Added more tests and one correction for format_str * format to LF and black * support passing in allow_copy * correct descibe_null for arrow and numpy * correct _get_validity_buffer to match describe_null * correct describe_null, convert_categorical_column and test_categorical_ordinal for categorical dtypes * Apply suggestions from code review * correct get_chunks for _VaexDataFrame * Replace return with yield in get_chunks * Check for LF and run black with -l 220 * Black with line length 220 * Add string dtype support * Add Arrow Dict check to describe_categorical * avoid copying data for strings * small fix * also test sliced dataframe * test that we do not copy data * Apply string no-mem copy suggestions * fix and test get_chunks * use future ordinal encoding feature * make test work with dict encoded Co-authored-by: Maarten A. Breddels <[email protected]> Co-authored-by: Alenka Frim <[email protected]>
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
This is the first draft of the Dataframe protocol implementation for Vaex.
It works for
categorize()
It is erroring with Arrow Dictionary when trying to convert dtype
arrow_type.to_pandas_dtype()
, line 305 in_dtype_from_vaexdtype
. Do you have any suggestions?What is still on my todo list:
expression.codes
see ✨ Expression.codes gives the numerical codes for categories/dict encoded #1503cc @maartenbreddels @JovanVeljanoski