-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(bench): improve heuristic to run throughput benchmarks #1868
base: main
Are you sure you want to change the base?
Conversation
I need a review of the design before exporting it to all other benchmark functions handling throughput variants. |
what's the idea of the heuristic ? Load depending on how many PBS are in an operation ? |
Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation. |
Did you check the measured throughput with the new loading factor is similar to the old one ? |
Yes I've just checked and they are the same 🎉 |
1fb02c8
to
d3593b0
Compare
|
use the CPU stats, they should be similar I would think |
603cd8b
to
dc59088
Compare
The PBS count for GPU is fixed. Ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?
Yes, for Cuda benchmarks, we're now down to 46 mins for de-duplicated operations in 64 bits. |
0a7b264
to
4c0162f
Compare
This is done to fill up backend with enough elements to fill the backend and avoid having long execution time for heavy operations like multiplication or division.
4c0162f
to
edb6501
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw I was still supposed to review some thing here ?
Btw @soonum just thought about something but enabling the pbs stats can have an impact on CPU performance as we update a single counter via many threads, so I guess there is a need (only for CPU) to do a first pass to measure PBSes for all ops and precision and relaunch throughput loading those data from file to avoid it having an adverse effect on precision I guess
we must not launch latency pbs with the pbs-stats feature
This changes the way we define the number of elements to load in the throughput pipeline.
Light operations needs more elements to saturate a backend. Conversely a heavy operations will require less elements to be able to run in a decent time.
That improvement will dramatically decrease benchmark total duration.
This has been tested with the following operations:
Saturation has been tested and successful on the following backends: