-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Own sqrt
and log
returning NaN
for "correct" multi-thread behaviour
#1781
Conversation
Review checklistThis checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging. Purpose and scope
Code quality
Documentation
Testing
Performance
Verification
Created with ❤️ by the Trixi.jl community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the initial investigation!
- Could you please report some performance numbers from elixirs with and without bounds checking?
- How do these full elixir runs vary when executing them multiple times?
- Could you please post some benchmarks like
@benchmark Trixi.rhs!(...)
? - Benchmarks like
x = rand(10^4); @btime sqrt.(x)
are not really meaningful for us since we don't perform such uniform operations on vectors. BenchmarkingTrixi.rhs!
would be better, I think.
Some reports on examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc Custom implementation: 1 Thread:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 7.880 ms … 11.458 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.562 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.565 ms ± 180.060 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▅▇▇█▅▂
▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▅▆▇█████████▆▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
7.88 ms Histogram: frequency by time 9.13 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
4 Threads:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 2.503 ms … 4.925 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.709 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.741 ms ± 161.206 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▇█▇▁
▂▂▁▂▂▂▂▃▅█████▇▇▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▁▁▂▂▂▁▂▂▂ ▃
2.5 ms Histogram: frequency by time 3.5 ms <
Memory estimate: 3.73 KiB, allocs estimate: 9. Standard 1 Thread:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 8.083 ms … 11.872 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.670 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.724 ms ± 244.216 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆▆█▅▃ ▁
▂▂▁▂▂▂▂▂▂▂▂▂▃▄▄▅▆▇█████████▆▆▄▄▄▃▄▃▃▃▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
8.08 ms Histogram: frequency by time 9.65 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
4 Threads:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 2.449 ms … 4.265 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.676 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.686 ms ± 89.104 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▆▆▅▄▃
▂▁▁▁▂▂▂▁▂▂▂▂▂▂▂▃▃▅▇█████████▆▅▃▂▂▂▂▁▂▂▂▂▂▂▁▂▂▁▁▁▂▂▂▁▂▁▂▂▂▂ ▃
2.45 ms Histogram: frequency by time 3.02 ms <
Memory estimate: 3.73 KiB, allocs estimate: 9.
Custom 1 Thread:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 50.370 ms … 70.366 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 52.456 ms ┊ GC (median): 0.00%
Time (mean ± σ): 52.799 ms ± 1.403 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▇█▆▃▂
▂▁▁▂▂▁▂▂▂▃▅███████████▇▅▄▄▃▃▃▃▃▃▂▃▂▂▂▃▂▂▃▂▂▂▂▁▂▂▂▁▂▂▂▁▁▁▂▁▂ ▃
50.4 ms Histogram: frequency by time 58 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
8 Threads:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 14.900 ms … 19.075 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.890 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.838 ms ± 431.002 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▁▄▂▂ ▁▄ ▆█▇▆▅▃▅▁
▂▁▁▁▁▁▁▁▁▂▁▂▂▁▂▃▁▁▄▂▄▃▄▄▄▅▄▇██████▇████████████▇▆▆▄▃▂▂▃▃▂▂▂▂ ▄
14.9 ms Histogram: frequency by time 18 ms <
Memory estimate: 1.41 KiB, allocs estimate: 5. Standard 1 Thread:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 50.199 ms … 66.471 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 51.948 ms ┊ GC (median): 0.00%
Time (mean ± σ): 52.097 ms ± 1.086 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃█▇█▄▄
▂▂▂▂▂▂▂▂▃▃▄▄▆████████▅▆▄▃▃▃▂▂▂▂▃▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▂▁▁▁▁▁▂ ▃
50.2 ms Histogram: frequency by time 56.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
8 Threads:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 14.270 ms … 18.180 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.018 ms ┊ GC (median): 0.00%
Time (mean ± σ): 15.950 ms ± 373.664 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▃▄▄▇▇█▇▃▂▁
▂▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▂▃▃▃▃▃▃▄▅▄▄▄▄▆▅▄▆▅▇▇▆▆█████████████▆▆▅▄▃▃▂▂▃ ▄
14.3 ms Histogram: frequency by time 16.7 ms <
Memory estimate: 1.41 KiB, allocs estimate: 5. |
Co-authored-by: Hendrik Ranocha <[email protected]>
Co-authored-by: Hendrik Ranocha <[email protected]>
sqrt
and log
returning NaN
for correct multi-thread behavioursqrt
and log
returning NaN
for "correct" multi-thread behaviour
@ranocha Maybe I found something that suits our needs: As for the log_(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
log_(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32, ), x) which actually return To still enable usage of algorithmic differentiation we would still provide log_(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x) Repeating the benchmarks from above: examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)
using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b) Custom 1 Thread:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 8.090 ms … 11.248 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.781 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.819 ms ± 234.715 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▇█▆▅▃▂▂▂▃▅▃▂▁
▂▁▂▁▁▁▂▂▃▂▂▃▂▂▂▃▄▆▇████████████████▅▄▄▄▄▃▄▃▃▃▃▃▂▂▂▂▃▂▂▂▂▂▂▂ ▄
8.09 ms Histogram: frequency by time 9.64 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
4 Threads:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 2.368 ms … 13.217 ms ┊ GC (min … max): 0.00% … 14.48%
Time (median): 2.734 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.796 ms ± 361.907 μs ┊ GC (mean ± σ): 0.03% ± 0.32%
▂▅██▁
▂▁▂▂▃▃▃▄▇█████▆▄▄▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▂ ▃
2.37 ms Histogram: frequency by time 4.2 ms <
Memory estimate: 3.73 KiB, allocs estimate: 9. Base 1 thread:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 8.238 ms … 10.269 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.764 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.788 ms ± 209.589 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▅█▆▅▆▆▃▄▆▅▅▄▄▁▁▁
▁▁▁▁▁▁▁▁▁▃▃▅▅▆█▇███████████████████▇▆▄▂▃▃▃▃▂▂▂▂▂▁▁▂▂▃▁▂▂▁▁▁ ▄
8.24 ms Histogram: frequency by time 9.49 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
4 threads:
BenchmarkTools.Trial: 2000 samples with 5 evaluations.
Range (min … max): 2.454 ms … 13.782 ms ┊ GC (min … max): 0.00% … 13.19%
Time (median): 2.803 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.866 ms ± 349.961 μs ┊ GC (mean ± σ): 0.03% ± 0.30%
▁ ▂▃▄▆▇███▇▆▄▄▂▂▂▁▁▁ ▂
▅▆▆▇████████████████████▇██▇█▇▆█▆▇▆▇▇▆▇█▆▅▅▅▄▅▅▁▅▅▅▄▆▅▅▄▅▅▅ █
2.45 ms Histogram: log(frequency) by time 4.03 ms <
Memory estimate: 3.73 KiB, allocs estimate: 9.
t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)
using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b) Custom 1 thread:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 51.155 ms … 67.260 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 53.327 ms ┊ GC (median): 0.00%
Time (mean ± σ): 53.673 ms ± 1.674 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▅▅█▇▆▂▁▄▅ ▂▂
▂▃▄▄▇██████████████▇▆▇▅▆▄▅▄▃▃▃▃▂▃▂▁▃▂▂▂▃▂▃▁▂▂▁▁▃▁▁▁▁▁▂▁▁▁▂▂ ▄
51.2 ms Histogram: frequency by time 61.1 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
8 threads:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 14.807 ms … 19.546 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.817 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.778 ms ± 373.889 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂ ▂▃▆▃█▆▆▂▂
▂▁▂▁▁▂▁▁▁▁▁▁▁▂▂▂▂▃▂▃▁▃▂▂▂▃▃▂▃▆▅▅▅▆█▇██▇█████████▇▇▅▄▄▄▃▃▃▂▃▂ ▄
14.8 ms Histogram: frequency by time 17.7 ms <
Memory estimate: 1.41 KiB, allocs estimate: 5.
Base 1 thread:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 50.650 ms … 66.282 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 53.364 ms ┊ GC (median): 0.00%
Time (mean ± σ): 53.588 ms ± 1.332 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃▇█▆▅▇▄▃▁
▂▁▁▁▂▁▃▂▂▄▅▇███████████▇▇▆▅▅▃▄▃▄▃▃▃▂▂▂▂▂▁▂▁▁▂▁▁▁▁▁▂▁▁▁▂▁▁▁▂ ▄
50.6 ms Histogram: frequency by time 59.6 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
8 threads:
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
Range (min … max): 15.194 ms … 19.291 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.770 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.745 ms ± 412.283 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▃▃▅█▆▄▄▃▂
▂▂▁▁▁▂▃▂▁▁▂▃▃▃▃▃▃▃▄▄▄▃▆▄▅▇▇█████████████▆▇▄▃▃▃▃▃▂▃▂▂▂▁▂▁▂▂▂▂ ▄
15.2 ms Histogram: frequency by time 18 ms <
Memory estimate: 1.41 KiB, allocs estimate: 5. These look almost identical to me, which I would consider a success. |
I'll see what I can do - unfortunately, for reliable performance measure I would need to block an entire node of our compute cluster for multiple hours, which might take quite some time to get scheduled. Alternatively, I can run this as a non-exclusive job at the expense of getting possibly less reliable results. |
Unfortunately, I get an error when (presumably) executing the benchmarks of the ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package.
Stacktrace:
[1] macro expansion
@ Base ./loading.jl:1766 [inlined]
[2] macro expansion
@ Base ./lock.jl:267 [inlined]
[3] __require(into::Module, mod::Symbol)
@ Base ./loading.jl:1747
[4] #invoke_in_world#3
@ Base ./essentials.jl:921 [inlined]
[5] invoke_in_world
@ Base ./essentials.jl:918 [inlined]
[6] require(into::Module, mod::Symbol)
@ Base ./loading.jl:1740 with standard output: PkgBenchmark: creating benchmark tuning file /rwthfs/rz/cluster/home/git/Trixi.jl/benchmark/tune.json...
(1/28) tuning "tree_2d_dgsem/elixir_euler_vortex_mortar.jl"...
(1/4) tuning "p3_rhs!"...
done (took 15.714234339 seconds)
(2/4) tuning "p7_rhs!"...
done (took 76.10260231 seconds)
(3/4) tuning "p7_analysis"...
done (took 21.453400237 seconds)
(4/4) tuning "p3_analysis"...
done (took 15.056647826 seconds)
done (took 131.906980299 seconds)
(2/28) tuning "tree_3d_dgsem/elixir_mhd_ec.jl"...
(1/4) tuning "p3_rhs!"...
done (took 29.225925116 seconds)
(2/4) tuning "p7_rhs!"...
done (took 375.223802844 seconds)
(3/4) tuning "p7_analysis"...
done (took 61.658343698 seconds)
(4/4) tuning "p3_analysis"...
done (took 16.326966469 seconds)
done (took 485.785440819 seconds)
(3/28) tuning "structured_3d_dgsem/elixir_euler_ec.jl"...
(1/4) tuning "p3_rhs!"...
done (took 21.638970102 seconds)
(2/4) tuning "p7_rhs!"...
done (took 145.468492825 seconds)
(3/4) tuning "p7_analysis"...
done (took 49.897601131 seconds)
(4/4) tuning "p3_analysis"...
done (took 14.41875533 seconds)
done (took 235.043929672 seconds)
(4/28) tuning "tree_3d_dgsem/elixir_euler_ec.jl"...
(1/4) tuning "p3_rhs!"...
done (took 84.156037201 seconds)
(2/4) tuning "p7_rhs!"...
done (took 1140.237702335 seconds)
(3/4) tuning "p7_analysis"...
done (took 317.161217237 seconds)
(4/4) tuning "p3_analysis"...
done (took 38.72742346 seconds)
done (took 1583.759410713 seconds)
(5/28) tuning "unstructured_2d_dgsem/elixir_euler_wall_bc.jl"...
(1/4) tuning "p3_rhs!"...
done (took 12.216714106 seconds)
(2/4) tuning "p7_rhs!"...
done (took 22.011360655 seconds)
(3/4) tuning "p7_analysis"...
done (took 12.402302703 seconds)
(4/4) tuning "p3_analysis"...
done (took 11.527143089 seconds)
done (took 61.965998148 seconds)
(6/28) tuning "tree_3d_dgsem/elixir_euler_shockcapturing.jl"...
(1/4) tuning "p3_rhs!"...
done (took 91.256684659 seconds)
(2/4) tuning "p7_rhs!"...
done (took 1200.799535653 seconds)
(3/4) tuning "p7_analysis"...
done (took 268.97470469 seconds)
(4/4) tuning "p3_analysis"...
done (took 32.808439139 seconds)
done (took 1597.80109128 seconds)
(7/28) tuning "tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"...
(1/4) tuning "p3_rhs!"...
done (took 10.464582383 seconds)
(2/4) tuning "p7_rhs!"...
done (took 16.305267609 seconds)
(3/4) tuning "p7_analysis"...
done (took 18.016344489 seconds)
(4/4) tuning "p3_analysis"...
done (took 12.809092958 seconds)
done (took 60.668745842 seconds)
(8/28) tuning "benchmark/elixir_2d_euler_vortex_p4est.jl"...
(1/4) tuning "p3_rhs!"...
done (took 11.264458026 seconds)
(2/4) tuning "p7_rhs!"...
done (took 30.218253739 seconds)
(3/4) tuning "p7_analysis"...
done (took 19.14552961 seconds)
(4/4) tuning "p3_analysis"...
done (took 13.664613786 seconds)
done (took 78.138848087 seconds)
(9/28) tuning "tree_3d_dgsem/elixir_advection_extended.jl"...
(1/4) tuning "p3_rhs!"...
done (took 18.862721663 seconds)
(2/4) tuning "p7_rhs!"...
done (took 204.710356465 seconds)
(3/4) tuning "p7_analysis"...
done (took 155.833574637 seconds)
(4/4) tuning "p3_analysis"...
done (took 21.683521389 seconds)
done (took 404.458885476 seconds)
(10/28) tuning "structured_2d_dgsem/elixir_advection_extended.jl"...
(1/4) tuning "p3_rhs!"...
done (took 12.846023065 seconds)
(2/4) tuning "p7_rhs!"...
done (took 22.663477303 seconds)
(3/4) tuning "p7_analysis"...
done (took 18.438284368 seconds)
(4/4) tuning "p3_analysis"...
done (took 15.56976736 seconds)
done (took 72.939499736 seconds)
(11/28) tuning "tree_2d_dgsem/elixir_advection_extended.jl"...
(1/4) tuning "p3_rhs!"...
done (took 9.233539848 seconds)
(2/4) tuning "p7_rhs!"...
done (took 17.373932421 seconds)
(3/4) tuning "p7_analysis"...
done (took 11.447854133 seconds)
(4/4) tuning "p3_analysis"...
done (took 12.105896462 seconds)
done (took 53.381192204 seconds)
(12/28) tuning "tree_2d_dgsem/elixir_euler_ec.jl"...
(1/4) tuning "p3_rhs!"...
done (took 18.572986465 seconds)
(2/4) tuning "p7_rhs!"...
done (took 108.93184312 seconds)
(3/4) tuning "p7_analysis"...
done (took 27.673408382 seconds)
(4/4) tuning "p3_analysis"...
done (took 19.16687415 seconds)
done (took 177.845071313 seconds)
(13/28) tuning "structured_2d_dgsem/elixir_euler_ec.jl"...
(1/4) tuning "p3_rhs!"...
done (took 9.990776535 seconds)
(2/4) tuning "p7_rhs!"...
done (took 29.410013647 seconds)
(3/4) tuning "p7_analysis"...
done (took 17.955411267 seconds)
(4/4) tuning "p3_analysis"...
done (took 12.113499793 seconds)
done (took 72.332992949 seconds)
(14/28) tuning "latency"...
(1/5) tuning "polydeg_3"...
PkgBenchmark: Running benchmarks... The script I execute is using PkgBenchmark, Trixi
results = judge(Trixi,
BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`), # target
BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`, id="main") # baseline
)
#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results) while I also tried using PkgBenchmark, Trixi
results = judge(Trixi,
BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`), # target
BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`, id="main") # baseline
)
#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results) I installed Trixi in dev mode from my fork of Trixi and switched to the to be tested branch. |
Did you install the development version of Trixi.jl also in the benchmark project as done in Trixi.jl/.github/workflows/benchmark.yml Lines 44 to 47 in 14151e6
in our GitHub action? I think the docs should be improved to describe this step in more detail (or at all 😅). |
No - I will give this a try 👍 |
I just found a problem in the benchmarks config. You need to update your local |
I'm running some stuff locally. It looks like the benchmarks setup is a bit bit-rotten... |
Hm, I still get the ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package. error, even after instatiating the package in the benchmarks directory both on main and NaNMath branch. |
Here is what I get on one of our servers: 1 thread
2 threads
It would be interesting to see results from another server/run. |
1 ThreadResultsA ratio greater than
Benchmark Group ListHere's a list of all the benchmark groups executed by this job:
Julia versioninfoTarget
Baseline
|
2 ThreadsResultsA ratio greater than
Benchmark Group ListHere's a list of all the benchmark groups executed by this job:
Julia versioninfoTarget
Baseline
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the benchmarks, too. As far as I understand, no benchmarks show regressions in two cases (either the same number of threads and your/mine server or fixed server and a different number of threads). Thus, I assume that there are no serious performance regressions in this PR.
Thanks a lot! This is nearly ready to merge - I just have a minor comment.
Co-authored-by: Hendrik Ranocha <[email protected]>
I ran another test to make sure and there are no shared elixirs with increased runtime for both single and multi threaded between both runs on the same system. |
Co-authored-by: Joshua Lampert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're ready to go from your point of view?
Yes! I plan file an issue/PR to the |
Motivation: See #1766
Inspiration for implementation: https://discourse.julialang.org/t/fastest-sqrt-and-log-with-negative-check/107575
I replaced for the moment only those
sqrt
andlog
where the argument can turn negative. Not sure if we want to use the custom implementation ofsqrt_
if it is really faster (for whatever reason).Making sure we do not loose (too much performance):
Example derived from examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with
surface_flux = flux_hllc
:Main:
NaNSqrt & NaNLog:
Verification using
BenchmarkTools
( I repeated these couple of times)Not sure what is going on with the
sqrt_
, butlog_
is marginally (0.5 - 0.3 micro sec per 10000 floats) slower (as one might expect)