Host compression #17656

vuule · 2024-12-23T21:29:24Z

Description

Add compression APIs to make the nvCOMP use transparent.
Remove direct dependency on nvCOMP in the ORC and Parquet writers.
Add multi-threaded host-side compression; currently off by default, can only be enabled via LIBCUDF_USE_HOST_COMPRESSION environment variable.

Currently the host compression adds D2H + H2D transfers. Avoiding the H2D transfer requires large changes to the writers.

Also moved handling of the AUTO compression type to the options classes, which should own such defaults (translate AUTO to SNAPPY in this case).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Co-authored-by: Bradley Dice <[email protected]>

…into comp-headers-cleanup

…high-lvl-comp-api

copy-pr-bot · 2024-12-23T21:29:27Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…high-lvl-comp-api

…o high-lvl-comp-api

vuule · 2025-01-08T17:53:49Z

cpp/src/io/orc/writer_impl.hpp

@@ -342,7 +342,7 @@ class writer::impl {
  // Writer options.
  stripe_size_limits const _max_stripe_size;
  size_type const _row_index_stride;
-  CompressionKind const _compression_kind;
+  compression_type const _compression;


keeping this as cudf's compression type leads to much fewer conversions

shrshi

Partial review -

cpp/src/io/comp/comp.cpp

PointKernel

Looks good.

I like how it identifies common utilities between ORC and Parquet to reduce duplication and improve consistency for ease of use.

PointKernel · 2025-01-09T00:40:41Z

cpp/include/cudf/io/orc.hpp

+  void set_compression(compression_type comp)
+  {
+    _compression = comp;
+    if (comp == compression_type::AUTO) { _compression = compression_type::SNAPPY; }


For my education, Is it common sense that AUTO should be just SNAPPY?

compression_type is common for all file formats, so AUTO may mean different compression types for different formats.
I guess we could remove AUTO and set a concrete compression type as the default for each format.

cpp/src/io/comp/comp.cpp

cpp/src/io/comp/comp.hpp

…o high-lvl-comp-api

shrshi

Looks good to me! :)

shrshi · 2025-01-10T23:38:10Z

cpp/src/io/comp/comp.cpp

+  stream.synchronize();
+
+  std::vector<std::future<size_t>> tasks;
+  auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());


Nit: should we cap the stream pool size to 32 here to avoid the warning in get_streams? (

cudf/cpp/src/utilities/stream_pool.cpp

Line 131 in dc2a75c

if (count > STREAM_POOL_SIZE) {

)

I forgot about that warning! Yup, should be capped at stream pool size.

We can consider moving the STREAM_POOL_SIZE variable to the header stream_pool.hpp and then invoke fork_streams as

Suggested change

auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());

auto const streams = cudf::detail::fork_streams(stream, std::min(STREAM_POOL_SIZE, h_comp_pool().get_thread_count()));

I'm trying to use cudf::detail::global_cuda_stream_pool().get_stream_pool_size(), that should work without changes to the pool

made the change, thanks for the suggestion!
also added an env var, since this reminded me that the default thread count could be bad.

Ooh, good idea about the thread count env var, I'll use it in the JSON PR #17708 as well!

…o high-lvl-comp-api

vuule and others added 20 commits December 17, 2024 16:55

random clean up

81dcfa6

jesus

4f7794d

Merge branch 'branch-25.02' into comp-headers-cleanup

3166acb

style

b3f03e8

style

53205c5

Merge branch 'branch-25.02' into comp-headers-cleanup

05d07ba

Merge branch 'branch-25.02' into comp-headers-cleanup

7d23502

Update cpp/src/io/comp/common.hpp

324d635

Co-authored-by: Bradley Dice <[email protected]>

Merge branch 'branch-25.02' into comp-headers-cleanup

350db40

Merge branch 'branch-25.02' into comp-headers-cleanup

0cf8375

fix

947fbd4

Merge branch 'comp-headers-cleanup' of https://github.com/vuule/cudf …

0a64f1c

…into comp-headers-cleanup

fix some more

54d9bb9

Merge branch 'branch-25.02' of https://github.com/rapidsai/cudf into …

2ca535b

…high-lvl-comp-api

avoid part of nvcomp enabled checks in writers

963f066

single-threaded host comp

e119ad8

now with more threads

3ab8c41

decouple orc writer from nvcomp

a14b351

Merge branch 'branch-25.02' of https://github.com/rapidsai/cudf into …

a8e1dec

…high-lvl-comp-api

decouple pq writer from nvcomp

0b6b72d

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 23, 2024

github-actions bot assigned vuule Dec 23, 2024

vuule added feature request New feature or request non-breaking Non-breaking change labels Dec 23, 2024

vuule added 5 commits December 23, 2024 14:08

missed DEFLATE

d83abac

simplify

e2dce81

clean up

9b8fc71

fix

7aaf5ed

Merge branch 'branch-25.02' into high-lvl-comp-api

9a5ca7d

vuule added 12 commits January 2, 2025 11:23

style some more

e010f9f

Merge branch 'branch-25.02' into high-lvl-comp-api

19cd311

Merge branch 'branch-25.02' into high-lvl-comp-api

3094173

Merge branch 'branch-25.02' of https://github.com/rapidsai/cudf into …

b83c1ff

…high-lvl-comp-api

handle AUTO compression in options

8ceecff

Merge branch 'high-lvl-comp-api' of https://github.com/vuule/cudf int…

7fa6055

…o high-lvl-comp-api

Merge branch 'branch-25.02' into high-lvl-comp-api

b2cdcf4

style

b5b06aa

Merge branch 'high-lvl-comp-api' of https://github.com/vuule/cudf int…

bfae53a

…o high-lvl-comp-api

Merge branch 'branch-25.02' into high-lvl-comp-api

11ca033

Merge branch 'branch-25.02' into high-lvl-comp-api

ea04f43

Merge branch 'branch-25.02' into high-lvl-comp-api

3e403b0

vuule commented Jan 8, 2025

View reviewed changes

vuule added 2 commits January 8, 2025 11:02

remove unused function

ae1b980

Merge branch 'branch-25.02' into high-lvl-comp-api

a725970

vuule marked this pull request as ready for review January 8, 2025 19:03

vuule requested a review from a team as a code owner January 8, 2025 19:03

vuule requested review from mythrocks and PointKernel January 8, 2025 19:03

shrshi reviewed Jan 8, 2025

View reviewed changes

cpp/src/io/comp/comp.cpp Show resolved Hide resolved

cpp/src/io/comp/comp.cpp Show resolved Hide resolved

PointKernel approved these changes Jan 9, 2025

View reviewed changes

vuule added 2 commits January 10, 2025 13:06

code review suggestions

05b5f3d

Merge branch 'high-lvl-comp-api' of https://github.com/vuule/cudf int…

70baa8d

…o high-lvl-comp-api

vuule requested a review from a team as a code owner January 10, 2025 21:06

Merge branch 'branch-25.02' into high-lvl-comp-api

b81c7f5

github-actions bot added the CMake CMake build issue label Jan 10, 2025

vuule requested a review from shrshi January 10, 2025 21:07

shrshi approved these changes Jan 10, 2025

View reviewed changes

vuule added 2 commits January 10, 2025 17:03

env var; limit num streams

c72b66d

Merge branch 'high-lvl-comp-api' of https://github.com/vuule/cudf int…

559ca43

…o high-lvl-comp-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host compression #17656

Host compression #17656

vuule commented Dec 23, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 23, 2024

vuule Jan 8, 2025

shrshi left a comment

PointKernel left a comment

PointKernel Jan 9, 2025

vuule Jan 10, 2025

shrshi left a comment

shrshi Jan 10, 2025

vuule Jan 11, 2025

shrshi Jan 11, 2025

vuule Jan 11, 2025

vuule Jan 11, 2025

shrshi Jan 11, 2025

	auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());
	auto const streams = cudf::detail::fork_streams(stream, std::min(STREAM_POOL_SIZE, h_comp_pool().get_thread_count()));

Host compression #17656

Are you sure you want to change the base?

Host compression #17656

Conversation

vuule commented Dec 23, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 23, 2024

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Dec 23, 2024 •

edited

Loading