Performance degradation using `^` on Float32 #1373

luraess · 2022-02-09T08:50:57Z

luraess
Feb 9, 2022

As reported in luraess/JuliaGPUPerf#1, there is an issue significantly affecting performance when doing ^ operation on Float32 within GPU Triad 2D kernels:
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float (see here)

Performance (memory throughput in GB/s) gets reduced by nearly a factor 2 compared to same experiment using Float64.

See https://github.com/luraess/JuliaGPUPerf/blob/main/cuda_bench.jl for reproducer (and README for perf. output).

All testing was done using Julia v1.7, CUDA v3.8, on devices using cuda 11.4 stack without artifact.

maleadt · 2022-02-14T13:27:16Z

maleadt
Feb 14, 2022
Maintainer

Could you isolate this more? I'm not seeing this on a quick test:

julia> a = CUDA.rand(1024, 1024);

julia> @benchmark CUDA.@sync a .^ 1.0
BenchmarkTools.Trial: 6074 samples with 1 evaluation.
 Range (min … max):  671.232 μs …  38.345 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     755.941 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   819.088 μs ± 612.675 μs  ┊ GC (mean ± σ):  0.14% ± 1.00%

  ▅▇██████▇▇▆▆▆▅▅▅▄▄▃▃▂▂▂▁▁▁     ▁        ▁  ▁                  ▃
  ████████████████████████████████████▇▇▇▇██▇██▇█▆▇▇▇▇▇▆▄▄▅▅▇▆▄ █
  671 μs        Histogram: log(frequency) by time       1.37 ms <

 Memory estimate: 3.59 KiB, allocs estimate: 62.

julia> @benchmark CUDA.@sync a .^ 1f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.989 μs …  4.728 ms  ┊ GC (min … max): 0.00% … 36.89%
 Time  (median):     42.750 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.648 μs ± 81.994 μs  ┊ GC (mean ± σ):  0.81% ±  0.57%

         ▃▆▇█▇▆▄▂                                              
  ▁▁▁▂▃▅▇████████▇▇▅▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  41 μs           Histogram: frequency by time          50 μs <

0 replies

luraess · 2022-02-14T15:11:57Z

luraess
Feb 14, 2022
Author

It is basically what the results from the tests using kernel programming report. I re-did some tests using array programming and it seems that already simple triad 2D is about 10% slower in Float32 compared to Float64. Note that since the F32 arrays are twice larger than the F64 ones, execution time may relate to memory throughput. But I'd rather refer to the kernel-based tests for accurate performance evaluation.

The below example show that besides the power operation itself, all other tests report F32 calculations to be about 10% slower compared to F64 ones. In kernel programming implementation, this difference is about 10% for triad 2D and goes down to 30% when having a power operation within the triad 2D kernel see here.

julia> using CUDA, BenchmarkTools

julia> N = 4096*4;

julia> A_f32 = CUDA.rand(Float32, 2*N,N); B_f32 = CUDA.rand(Float32, 2*N,N);

julia> A_f64 = CUDA.rand(Float64, N,N); B_f64 = CUDA.rand(Float64, N,N);

julia> @belapsed CUDA.@sync B_f32 .^ 3.5f0
0.007986453

julia> @belapsed CUDA.@sync B_f64 .^ 3.5
0.00825651

julia> @belapsed CUDA.@sync A_f32 .+ B_f32 .^ 3.5f0
0.009758801

julia> @belapsed CUDA.@sync A_f64 .+ B_f64 .^ 3.5
0.008997744

julia> @belapsed CUDA.@sync A_f32 .+ rand(Float32) .* B_f32 .^ 3.5f0
0.00975643

julia> @belapsed CUDA.@sync A_f64 .+ rand(Float64) .* B_f64 .^ 3.5
0.008991102

julia> @belapsed CUDA.@sync A_f32 .+ B_f32
0.009054031

julia> @belapsed CUDA.@sync A_f64 .+ B_f64
0.008072146

julia> @belapsed CUDA.@sync A_f32 .+ rand(Float32) .* B_f32
0.009082489

julia> @belapsed CUDA.@sync A_f64 .+ rand(Float64) .* B_f64
0.008084918

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4

Libraries:
- CUBLAS: 11.5.2
- CURAND: 10.2.5
- CUFFT: 10.5.0
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.42.1
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

1 device:
  0: Tesla V100-PCIE-16GB (sm_70, 5.441 GiB / 15.782 GiB available)

julia>

0 replies

maleadt · 2022-02-14T15:43:39Z

maleadt
Feb 14, 2022
Maintainer

Sorry to be dense, but what exactly is the issue? We're just calling libdevice functions here; what exactly do you expect CUDA.jl to do? It's not like we can implement Float32 power using Float64 because it is actually faster, but you're expecting it to be twice as fast (which isn't guaranteed in any way). Unless CUDA C code doing the exact same operation performs better, this isn't a bug.

And FWIW, on other hardware (Quadro RTX 5000, so not even consumer hardware where the difference would be even larger) the performance characteristics are completely different:

julia> @belapsed CUDA.@sync B_f32 .^ 3.5f0
0.021240891

julia> @belapsed CUDA.@sync B_f64 .^ 3.5
0.157291178

0 replies

luraess · 2022-02-14T19:13:43Z

luraess
Feb 14, 2022
Author

The original issue was that Triad 2D benchmarks implemented in CUDA.jl using a kernel programming approach achieve unexpected performance that I cannot fully understand. All tests are performed using large arrays, making sure to discard effects due to dealing with overheads, and taking care to have the same amount of bytes transferred between single and double precision tests. Using effective memory throughput permits to compare against the total or peak memory bandwidth of targeted GPUs making comparison possible.

On latest server-class GPUs (Tesla V100, A100), single precision throughput are always between 5-15% slower compared to double precision achieve throughput.
The more puzzling fact is that the performance of adding a power operation significantly deprecates the particular case Array_Float32^Float32 (when compared e.g. to the Float64 counterpart). The deprecation is e.g. enormous on Titan Xm.

These results do not suggest there is a bug here, but I am trying to understand what would be reasons leading to these minor to more significant perf deprecations - and see if there is anything one could improve there. I'll prepare CUDA C kernels for comparison.

0 replies

maleadt · 2022-02-14T19:26:13Z

maleadt
Feb 14, 2022
Maintainer

Using effective memory throughput permits to compare against the total or peak memory bandwidth of targeted GPUs making comparison possible.

Why do you expect to attain peak memory bandwidth when there's a nontrivial compute component? The effect of the pow cannot just be left outside the picture.

0 replies

luraess · 2022-02-14T19:45:37Z

luraess
Feb 14, 2022
Author

You are correct wrt the effect of including pow in the Triad kernel. And my suggestion is that using memory throughput as perf metric permits to assess how far from optimal, peak, one is while adding non trivial compute in memory bound kernel. I was curious to see if there was a way to get better perf in some cases for pow, as this is a quite common operation in our codes.

0 replies

maleadt · 2022-02-14T20:03:40Z

maleadt
Feb 14, 2022
Maintainer

Are your components integer? If so you can try decomposing into multiplications.

Otherwise, you can try looking into alternative implementations of pow that may be faster. We currently use the one from libdevice, so you can try disassembling that code to have a look at the implementation (at the LLVM IR level). One potential improvement I can think of is using FTZ=true (flushing denormal floats to zero), which should be enabled by JuliaGPU/GPUCompiler.jl#280 if you start Julia in fast math mode.

0 replies

luraess · 2022-02-14T20:42:21Z

luraess
Feb 14, 2022
Author

try decomposing into multiplications

This works indeed super well, and we are doing it for integers.

[...] if you start Julia in fast math mode

I'll look at the suggestions definitively - thanks! I am though cautious with fast math mode as sometimes it may return unexpected results.

0 replies

maleadt · 2022-02-14T20:46:09Z

maleadt
Feb 14, 2022
Maintainer

try decomposing into multiplications

This works indeed super well, and we are doing it for integers.

FYI, CUDA.jl is trying to do so as well:

CUDA.jl/src/device/intrinsics/math.jl

Lines 235 to 250 in f7880a3

    
           @device_override @inline function Base.:(^)(x::Float32, y::Int64) 
        
               y == -1 && return inv(x) 
        
               y == 0 && return one(x) 
        
               y == 1 && return x 
        
               y == 2 && return x*x 
        
               y == 3 && return x*x*x 
        
               x ^ Float32(y) 
        
           end 
        
           @device_override @inline function Base.:(^)(x::Float64, y::Int64) 
        
               y == -1 && return inv(x) 
        
               y == 0 && return one(x) 
        
               y == 1 && return x 
        
               y == 2 && return x*x 
        
               y == 3 && return x*x*x 
        
               x ^ Float64(y) 
        
           end

0 replies

luraess · 2022-02-14T20:54:13Z

luraess
Feb 14, 2022
Author

Yeah - that's cool. AMDGPU has it as well, which works fine for power of Int.

0 replies

maleadt · 2022-02-15T06:33:28Z

maleadt
Feb 15, 2022
Maintainer

I've converted this into a discussion since this isn't really a CUDA.jl issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation using `^` on Float32 #1373

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance degradation using ^ on Float32 #1373

luraess Feb 9, 2022

Replies: 11 comments

maleadt Feb 14, 2022 Maintainer

luraess Feb 14, 2022 Author

maleadt Feb 14, 2022 Maintainer

luraess Feb 14, 2022 Author

maleadt Feb 14, 2022 Maintainer

luraess Feb 14, 2022 Author

maleadt Feb 14, 2022 Maintainer

luraess Feb 14, 2022 Author

maleadt Feb 14, 2022 Maintainer

luraess Feb 14, 2022 Author

maleadt Feb 15, 2022 Maintainer

Performance degradation using `^` on Float32 #1373

luraess
Feb 9, 2022

maleadt
Feb 14, 2022
Maintainer

luraess
Feb 14, 2022
Author

maleadt
Feb 14, 2022
Maintainer

luraess
Feb 14, 2022
Author

maleadt
Feb 14, 2022
Maintainer

luraess
Feb 14, 2022
Author

maleadt
Feb 14, 2022
Maintainer

luraess
Feb 14, 2022
Author

maleadt
Feb 14, 2022
Maintainer

luraess
Feb 14, 2022
Author

maleadt
Feb 15, 2022
Maintainer