Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(bench): improve heuristic to run throughput benchmarks #1868

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

soonum
Copy link
Contributor

@soonum soonum commented Dec 13, 2024

This changes the way we define the number of elements to load in the throughput pipeline.
Light operations needs more elements to saturate a backend. Conversely a heavy operations will require less elements to be able to run in a decent time.
That improvement will dramatically decrease benchmark total duration.

This has been tested with the following operations:

  • bitand (light operation)
  • add (mid operation)
  • div_rem (heavy operation):

Saturation has been tested and successful on the following backends:

  • CPU
  • GPU

@cla-bot cla-bot bot added the cla-signed label Dec 13, 2024
@soonum soonum self-assigned this Dec 13, 2024
@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

I need a review of the design before exporting it to all other benchmark functions handling throughput variants.

@IceTDrinker
Copy link
Member

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation.
This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations.
That way the number of elements to pass during the throughput benchmark is dynamically generated.

@IceTDrinker
Copy link
Member

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation. This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations. That way the number of elements to pass during the throughput benchmark is dynamically generated.

Did you check the measured throughput with the new loading factor is similar to the old one ?

@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

Yes I've just checked and they are the same 🎉

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 1fb02c8 to d3593b0 Compare December 20, 2024 09:48
@soonum
Copy link
Contributor Author

soonum commented Dec 20, 2024

get_pbs_count() do not increment the atomic counter on GPU.
Thus I cannot test this new implementation on GPU for now.

@IceTDrinker
Copy link
Member

get_pbs_count() do not increment the atomic counter on GPU. Thus I cannot test this new implementation on GPU for now.

use the CPU stats, they should be similar I would think

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 603cd8b to dc59088 Compare January 6, 2025 14:52
@soonum soonum marked this pull request as ready for review January 6, 2025 15:19
@soonum
Copy link
Contributor Author

soonum commented Jan 6, 2025

The PBS count for GPU is fixed. Ready for review now.

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

tfhe/benches/integer/bench.rs Show resolved Hide resolved
tfhe/benches/integer/bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/bench.rs Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
@soonum
Copy link
Contributor Author

soonum commented Jan 7, 2025

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

Yes, for Cuda benchmarks, we're now down to 46 mins for de-duplicated operations in 64 bits.
I'll launch a full precision on default ops today.

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 3 times, most recently from 0a7b264 to 4c0162f Compare January 9, 2025 10:27
This is done to fill up backend with enough elements to fill the
backend and avoid having long execution time for heavy operations
like multiplication or division.
@soonum soonum force-pushed the dt/bench/throughput_heuristic branch from 4c0162f to edb6501 Compare January 9, 2025 10:30
@soonum soonum requested a review from agnesLeroy January 9, 2025 11:11
Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw I was still supposed to review some thing here ?

Btw @soonum just thought about something but enabling the pbs stats can have an impact on CPU performance as we update a single counter via many threads, so I guess there is a need (only for CPU) to do a first pass to measure PBSes for all ops and precision and relaunch throughput loading those data from file to avoid it having an adverse effect on precision I guess

we must not launch latency pbs with the pbs-stats feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants