-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dot_product #24
Comments
@jalvesz, thanks for your work. I found another solution for this issue using the real(rk), volatile :: a
call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
do nl = 1,bench%nloops
a = dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops) |
Brilliant! I did a commit on my fork of your project with the proposed idea, testing on a different computer I saw your Blas interface in m2 actually being the most performant one. So this seems to vary quite a bit from machine to machine. |
And I think we should consider reducing the number of tests from |
5 sounds reasonable |
Yes, I did many tests for |
I changed the number of elements to |
Your new results look the results I was obtaining with one of the machines I tested. Where the I'll recheck later with your new version |
I couldn't find the system specifications for the GitHub server. I have now tested this approach on both my system and GitHub (I have to push new tests to GitHub to check the results consistently). On my system, I achieved the desired results, but on GitHub, not so much. I believe relying on benchmarking on the GitHub server without a self-hosted runner is not a good idea. I will change the CI to push results from my system. What's your opinion? call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
do nl = 1,bench%nloops
a = dot_product(u,v)
call prevent_optimization(a,nl) ! loop-invariant
end do
call bench%stop_benchmark(cmp_gflops)
subroutine prevent_optimization(a, nl)
real(rk), intent(in) :: a
integer, intent(in) :: nl
if (a == 0.0_rk) print*, nl, 'a = 0.0'
end subroutine prevent_optimization |
The function to avoid optimization works quite well. I think that the benchmarks results heavily depend on the systems load. I keep obtaining different trends specially for the blas case between two different computers. In one basically running only the benchmark, another one that has several applications open. Most probably the github server is also under this circumstances thus it can not get optimal performance. Maybe the bench should be run under two scenarios: a) A dedicated machine for which a socket can be allocated only for that b) a workstation with more than one load. Those should cover use for HPC and for "daily" work kind of limit cases. The reference data files should then be signed with at least the characteristics of the cpu+ram+os. It is always difficult to draw the line of "fairness"/"representativeness" for these kind of benchmarks. |
Yes, exactly.
Yes, that's a good idea. I think that after gathering and improving the infrastructure, it will be possible to run all benchmarks on a cluster using slurm and sbatch, as you mentioned. I am also considering, for the next version, running the entire benchmark code multiple times and obtaining average values. I have updated the Python scripts and pushed the latest changes. I believe the 'dot_product' benchmark is working well for now. I will work on the 'matmul' results to generate same plots. Thanks again for contributing and testing! |
Here just a wild thought to consider: what about benchmarking matmul and dot_product together with a traditional Conjugate Gradient with fixed number of iterations? These two kernels are usually used together and it also implies that the program has to switch kernel quite fast, so this might limit optimizations thus providing a more "realistic" view on the performance of each algorithm in the context of working in tandem in a more complex problem ? |
Yes, of course! I think this is a good idea. We could create a simple example. I will check if I have any implementation in ForSolver, or if you have some implementation, we could start with that. But what should be our reference? Should we consider other libraries? |
I've been thinking about this idea. Most of the problems I have at hand are on sparse systems. Here the idea is to benchmark matmul on square matrices and the dot_product. So, I also was thinking that the important thing for the benchmark is performance, in that case, we could think about generating a synthetic problem that enables evaluating the different algorithms performance. Here one idea: Given a matrix size, create a symmetric positive definite matrix by first creating the upper triangular part (U) with random numbers and then All implementations should converge to similar results within a given tolerance. The result itself would be irrelevant to certain degree, the important point would be to very consistency and speed-ups. What do you think? |
This is generally a good idea. I have first two initial questions:
|
If you want to benchmark
|
Yes, you are right! I was overcomplicating things, I would split the operations thought. The idea behind being to time each operation as they play along.
Not necessarily, but to have an idea of how far a naive implementation, a simple intrinsic or an implementation with some tweaks would behave and its comparison against the reference being BLAS. |
The cumulative variable + the
print
forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating:ifort
ifx
gfortran
Originally posted by @jalvesz in jalvesz/fast_math#8 (comment)
The text was updated successfully, but these errors were encountered: