Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cudf._lib.column in favor of pylibcudf. #17760

Open
wants to merge 9 commits into
base: branch-25.02
Choose a base branch
from

Conversation

mroeschke
Copy link
Contributor

Description

Removes cudf._lib.column.Column and moves its methods and attributes to column.core.column.ColumnBase

  • Some methods in cudf.core._internals needed to start returning pylibcudf.Columns to avoid circular imports
  • I added pylibcudf.Column.column_from_self_view to return a pylibcudf.Column from its own view. This was meant to replace this snippet
                children = Column.from_unique_ptr(
                    move(make_unique[column](self.view()))
                ).base_children

(Appears this is needed to calculate the children from a new column with a different size, open to better ways to do this)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 17, 2025
@mroeschke mroeschke self-assigned this Jan 17, 2025
@mroeschke mroeschke requested review from a team as code owners January 17, 2025 03:18
@github-actions github-actions bot added CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jan 17, 2025
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake. I did not review the Python code.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions. I haven't diligently gone through every line but I think most of it is just Column.from_pylibcudf -> ColumnBase.from_pylibcudf?

Comment on lines +311 to +318
return cudf.core.column.build_column( # type: ignore[return-value]
data=self.data,
dtype=self.dtype,
mask=mask,
size=self.size,
offset=0,
children=self.children,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: (and possibly this is just moving code so it was already there before). What happens if the previous column offset was > 0. Does that not mean the data buffer pointer is now in the wrong place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was part of just moving code, but the docstring claims

Replaces the mask buffer of the column and returns a new column. This will zero the column offset, compute a new mask buffer if necessary, and compute new data Buffers zero-copy that use pointer arithmetic to properly adjust the pointer.

Last changed in #4057

Comment on lines +320 to +332
@property
def null_count(self) -> int:
if self._null_count is None:
if not self.nullable or self.size == 0:
self._null_count = 0
else:
with acquire_spill_lock():
self._null_count = plc.null_mask.null_count(
self.base_mask.get_ptr(mode="read"), # type: ignore[union-attr]
self.offset,
self.offset + self.size,
)
return self._null_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: pylibcudf columns claim to have a correct null count by construction, why must we launch a kernel here? Is it because we don't always start from a pylibcudf column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we don't always start from a pylibcudf column?

Yes correct. At least in cudf classic there are routines that use build_column to make a new column, and build_column accepts a null_count: int | None so it appears cudf classic lazily computes the null_count if not provided.

@@ -628,6 +628,35 @@ def dtype_to_pylibcudf_type(dtype) -> plc.DataType:
return plc.DataType(SUPPORTED_NUMPY_TO_PYLIBCUDF_TYPES[dtype])


def dtype_from_pylibcudf_column(col: plc.Column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: type for the return type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done

Comment on lines 472 to 476
cpdef Column column_from_column_view(Column col):
"""
Return a new Column from a Column.view().
"""
return Column.from_libcudf(move(make_unique[column](col.view())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this one when you can just call col.column_from_self_view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops this was a remnant when I thought this should have been a free function. Removed

Comment on lines 176 to 178
cpdef Column column_from_self_view(self):
"""Return a new column from self.view()."""
return Column.from_libcudf(move(make_unique[column](self.view())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: shall we call this what it is? cpdef Column copy(self) ? Or possibly deepcopy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, this made me realize pylibcudf Column already has a copy method. Thanks I was able to not reinvent the wheel here then

@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants