Replies: 11 comments
-
Could you isolate this more? I'm not seeing this on a quick test:
|
Beta Was this translation helpful? Give feedback.
-
It is basically what the results from the tests using kernel programming report. I re-did some tests using array programming and it seems that already simple triad 2D is about 10% slower in The below example show that besides the power operation itself, all other tests report F32 calculations to be about 10% slower compared to F64 ones. In kernel programming implementation, this difference is about 10% for triad 2D and goes down to 30% when having a power operation within the triad 2D kernel see here.
|
Beta Was this translation helpful? Give feedback.
-
Sorry to be dense, but what exactly is the issue? We're just calling libdevice functions here; what exactly do you expect CUDA.jl to do? It's not like we can implement Float32 power using Float64 because it is actually faster, but you're expecting it to be twice as fast (which isn't guaranteed in any way). Unless CUDA C code doing the exact same operation performs better, this isn't a bug. And FWIW, on other hardware (Quadro RTX 5000, so not even consumer hardware where the difference would be even larger) the performance characteristics are completely different:
|
Beta Was this translation helpful? Give feedback.
-
The original issue was that Triad 2D benchmarks implemented in CUDA.jl using a kernel programming approach achieve unexpected performance that I cannot fully understand. All tests are performed using large arrays, making sure to discard effects due to dealing with overheads, and taking care to have the same amount of bytes transferred between single and double precision tests. Using effective memory throughput permits to compare against the total or peak memory bandwidth of targeted GPUs making comparison possible.
These results do not suggest there is a bug here, but I am trying to understand what would be reasons leading to these minor to more significant perf deprecations - and see if there is anything one could improve there. I'll prepare CUDA C kernels for comparison. |
Beta Was this translation helpful? Give feedback.
-
Why do you expect to attain peak memory bandwidth when there's a nontrivial compute component? The effect of the |
Beta Was this translation helpful? Give feedback.
-
You are correct wrt the effect of including |
Beta Was this translation helpful? Give feedback.
-
Are your components integer? If so you can try decomposing into multiplications. Otherwise, you can try looking into alternative implementations of |
Beta Was this translation helpful? Give feedback.
-
This works indeed super well, and we are doing it for integers.
I'll look at the suggestions definitively - thanks! I am though cautious with fast math mode as sometimes it may return unexpected results. |
Beta Was this translation helpful? Give feedback.
-
FYI, CUDA.jl is trying to do so as well: CUDA.jl/src/device/intrinsics/math.jl Lines 235 to 250 in f7880a3 |
Beta Was this translation helpful? Give feedback.
-
Yeah - that's cool. AMDGPU has it as well, which works fine for power of |
Beta Was this translation helpful? Give feedback.
-
I've converted this into a discussion since this isn't really a CUDA.jl issue. |
Beta Was this translation helpful? Give feedback.
-
As reported in luraess/JuliaGPUPerf#1, there is an issue significantly affecting performance when doing ^ operation on Float32 within GPU Triad 2D kernels:
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float
(see here)Performance (memory throughput in GB/s) gets reduced by nearly a factor 2 compared to same experiment using Float64.
See https://github.com/luraess/JuliaGPUPerf/blob/main/cuda_bench.jl for reproducer (and README for perf. output).
All testing was done using Julia v1.7, CUDA v3.8, on devices using cuda 11.4 stack without artifact.
Beta Was this translation helpful? Give feedback.
All reactions