-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove cudf._lib.column in favor of pylibcudf. #17760
base: branch-25.02
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake. I did not review the Python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small suggestions. I haven't diligently gone through every line but I think most of it is just Column.from_pylibcudf -> ColumnBase.from_pylibcudf
?
return cudf.core.column.build_column( # type: ignore[return-value] | ||
data=self.data, | ||
dtype=self.dtype, | ||
mask=mask, | ||
size=self.size, | ||
offset=0, | ||
children=self.children, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: (and possibly this is just moving code so it was already there before). What happens if the previous column offset was > 0
. Does that not mean the data
buffer pointer is now in the wrong place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was part of just moving code, but the docstring claims
Replaces the mask buffer of the column and returns a new column. This will zero the column offset, compute a new mask buffer if necessary, and compute new data Buffers zero-copy that use pointer arithmetic to properly adjust the pointer.
Last changed in #4057
@property | ||
def null_count(self) -> int: | ||
if self._null_count is None: | ||
if not self.nullable or self.size == 0: | ||
self._null_count = 0 | ||
else: | ||
with acquire_spill_lock(): | ||
self._null_count = plc.null_mask.null_count( | ||
self.base_mask.get_ptr(mode="read"), # type: ignore[union-attr] | ||
self.offset, | ||
self.offset + self.size, | ||
) | ||
return self._null_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: pylibcudf columns claim to have a correct null count by construction, why must we launch a kernel here? Is it because we don't always start from a pylibcudf column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we don't always start from a pylibcudf column?
Yes correct. At least in cudf classic there are routines that use build_column
to make a new column, and build_column
accepts a null_count: int | None
so it appears cudf classic lazily computes the null_count
if not provided.
python/cudf/cudf/utils/dtypes.py
Outdated
@@ -628,6 +628,35 @@ def dtype_to_pylibcudf_type(dtype) -> plc.DataType: | |||
return plc.DataType(SUPPORTED_NUMPY_TO_PYLIBCUDF_TYPES[dtype]) | |||
|
|||
|
|||
def dtype_from_pylibcudf_column(col: plc.Column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: type for the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done
cpdef Column column_from_column_view(Column col): | ||
""" | ||
Return a new Column from a Column.view(). | ||
""" | ||
return Column.from_libcudf(move(make_unique[column](col.view()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this one when you can just call col.column_from_self_view
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops this was a remnant when I thought this should have been a free function. Removed
cpdef Column column_from_self_view(self): | ||
"""Return a new column from self.view().""" | ||
return Column.from_libcudf(move(make_unique[column](self.view()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: shall we call this what it is? cpdef Column copy(self)
? Or possibly deepcopy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, this made me realize pylibcudf Column already has a copy
method. Thanks I was able to not reinvent the wheel here then
Description
Removes
cudf._lib.column.Column
and moves its methods and attributes tocolumn.core.column.ColumnBase
cudf.core._internals
needed to start returningpylibcudf.Column
s to avoid circular importspylibcudf.Column.column_from_self_view
to return apylibcudf.Column
from its own view. This was meant to replace this snippet(Appears this is needed to calculate the children from a new column with a different size, open to better ways to do this)
Checklist