Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host compression #17656

Open
wants to merge 46 commits into
base: branch-25.02
Choose a base branch
from
Open

Host compression #17656

wants to merge 46 commits into from

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Dec 23, 2024

Description

Add compression APIs to make the nvCOMP use transparent.
Remove direct dependency on nvCOMP in the ORC and Parquet writers.
Add multi-threaded host-side compression; currently off by default, can only be enabled via LIBCUDF_USE_HOST_COMPRESSION environment variable.

Currently the host compression adds D2H + H2D transfers. Avoiding the H2D transfer requires large changes to the writers.

Also moved handling of the AUTO compression type to the options classes, which should own such defaults (translate AUTO to SNAPPY in this case).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 23, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 23, 2024
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Dec 23, 2024
@@ -342,7 +342,7 @@ class writer::impl {
// Writer options.
stripe_size_limits const _max_stripe_size;
size_type const _row_index_stride;
CompressionKind const _compression_kind;
compression_type const _compression;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this as cudf's compression type leads to much fewer conversions

@vuule vuule marked this pull request as ready for review January 8, 2025 19:03
@vuule vuule requested a review from a team as a code owner January 8, 2025 19:03
@vuule vuule requested review from mythrocks and PointKernel January 8, 2025 19:03
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review -

cpp/src/io/comp/comp.cpp Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I like how it identifies common utilities between ORC and Parquet to reduce duplication and improve consistency for ease of use.

void set_compression(compression_type comp)
{
_compression = comp;
if (comp == compression_type::AUTO) { _compression = compression_type::SNAPPY; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, Is it common sense that AUTO should be just SNAPPY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression_type is common for all file formats, so AUTO may mean different compression types for different formats.
I guess we could remove AUTO and set a concrete compression type as the default for each format.

cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.hpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.hpp Show resolved Hide resolved
@vuule vuule requested a review from a team as a code owner January 10, 2025 21:06
@github-actions github-actions bot added the CMake CMake build issue label Jan 10, 2025
@vuule vuule requested a review from shrshi January 10, 2025 21:07
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! :)

stream.synchronize();

std::vector<std::future<size_t>> tasks;
auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we cap the stream pool size to 32 here to avoid the warning in get_streams? (

if (count > STREAM_POOL_SIZE) {
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about that warning! Yup, should be capped at stream pool size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider moving the STREAM_POOL_SIZE variable to the header stream_pool.hpp and then invoke fork_streams as

Suggested change
auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());
auto const streams = cudf::detail::fork_streams(stream, std::min(STREAM_POOL_SIZE, h_comp_pool().get_thread_count()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to use cudf::detail::global_cuda_stream_pool().get_stream_pool_size(), that should work without changes to the pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change, thanks for the suggestion!
also added an env var, since this reminded me that the default thread count could be bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, good idea about the thread count env var, I'll use it in the JSON PR #17708 as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

3 participants