Non-deterministic results with reverse-mode `autodiff` on CUDA #2210

utkinis · 2024-12-18T18:04:40Z

Bug description

When computing pullbacks on CUDA kernels, sometimes (how often depends on the code and launch configuration) the results are incorrect. The same pullbacks are computed just fine on single-threaded CPU. With reducing block size, the bug seems to appear less often or doesn't appear at all. Could it be that the register spills aren't handled correctly in the AD path? The forward kernel in my example uses ~90 registers, and the adjoint kernel uses ~190 registers.

MWE

Sorry for the long snippet, the bug is triggered consistently on my main computational code, and this is my best attempt to reduce the amount of code while still having the bug:

using Enzyme, CUDA, StaticArrays

# helper functions for static arrays
δˣₐ(S::StaticMatrix{3,<:Any}) = S[SVector(2, 3), :] .- S[SVector(1, 2), :]
δʸₐ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2, 3)] .- S[:, SVector(1, 2)]

δˣ(S::StaticMatrix{3,3}) = S[SVector(2, 3), SVector(2)] .- S[SVector(1, 2), SVector(2)]
δʸ(S::StaticMatrix{3,3}) = S[SVector(2), SVector(2, 3)] .- S[SVector(2), SVector(1, 2)]

δˣ(S::StaticMatrix{2,1}) = S[2] - S[1]
δʸ(S::StaticMatrix{1,2}) = S[2] - S[1]

function av4(S::StaticMatrix{2,3})
    0.25 .* (S[SVector(1), SVector(1, 2)] .+ S[SVector(2), SVector(1, 2)] .+
             S[SVector(1), SVector(2, 3)] .+ S[SVector(2), SVector(2, 3)])
end

function av4(S::StaticMatrix{3,2})
    0.25 .* (S[SVector(1, 2), SVector(1)] .+ S[SVector(1, 2), SVector(2)] .+
             S[SVector(2, 3), SVector(1)] .+ S[SVector(2, 3), SVector(2)])
end

innˣ(S::StaticMatrix{3,<:Any}) = S[SVector(2), :]
innʸ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2)]

# extract 3x3 stencil
function st3x3(M, ix, iy)
    nx, ny = oftype.((ix, iy), size(M))
    # neighbor indices
    di = oneunit(ix)
    dj = oneunit(iy)
    iW = max(ix - di, di)
    iE = min(ix + di, nx)
    iS = max(iy - dj, dj)
    iN = min(iy + dj, ny)
    return SMatrix{3,3}(M[iW, iS], M[ix, iS], M[iE, iS],
                        M[iW, iy], M[ix, iy], M[iE, iy],
                        M[iW, iN], M[ix, iN], M[iE, iN])
end

# Enzyme utils
∇(fun, args::Vararg{Any,N}) where {N} = (Enzyme.autodiff_deferred(Enzyme.Reverse, Const(fun), Const, args...); return)
const DupNN = DuplicatedNoNeed

function residual(H, n)
    # surface gradient
    ∇Hˣ = δˣₐ(H)
    ∇Hʸ = δʸₐ(H)

    # surface gradient magnitude
    ∇Sˣ = sqrt.(innʸ(∇Hˣ) .^ 2 .+ av4(∇Hʸ) .^ 2) .^ (n - 1)
    ∇Sʸ = sqrt.(av4(∇Hˣ) .^ 2 .+ innˣ(∇Hʸ) .^ 2) .^ (n - 1)

    qˣ = ∇Sˣ .* δˣ(H .^ n)
    qʸ = ∇Sʸ .* δʸ(H .^ n)

    r = δˣ(qˣ) + δʸ(qʸ)

    return r
end

function gpu_residual!(r, H, n)
    ix = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y - Int32(1)) * blockDim().y + threadIdx().y

    Hₗ        = st3x3(H, ix, iy)
    r[ix, iy] = residual(Hₗ, n)

    return
end

function gpu_runme()
    nthreads = 32, 8 # triggered much less often with 32, 4
    nx, ny   = nthreads
    # power law exponent
    n = 3
    # arrays
    H = CuArray(collect(2000 * Float64(i + j) for i in 1:nx, j in 1:ny))
    r = CUDA.zeros(Float64, nx, ny)
    # shadows
    r̄1 = CUDA.ones(Float64, nx, ny)
    r̄2 = CUDA.ones(Float64, nx, ny)
    H̄1 = CUDA.zeros(Float64, nx, ny)
    H̄2 = CUDA.zeros(Float64, nx, ny)

    @cuda threads = nthreads gpu_residual!(r, H, n)
    @cuda threads = nthreads ∇(gpu_residual!, DupNN(r, r̄1), DupNN(H, H̄1), Const(n))

    for i in 1:1000
        r̄2 .= 1.0
        H̄2 .= 0.0
        @cuda threads = nthreads gpu_residual!(r, H, n)
        @cuda threads = nthreads ∇(gpu_residual!, DupNN(r, r̄2), DupNN(H, H̄2), Const(n))

        if H̄1 != H̄2
            display(H̄1)
            display(H̄2)
            display(H̄1 .- H̄2)
            error("CUDA: non-deterministic results at iteration $i")
        end
    end

    println("CUDA: no errors")

    return
end

gpu_runme()

# The code below is optional
function cpu_residual!(r, H, n)
    nx, ny = size(H)
    for ix in 1:nx, iy in 1:ny
        Hₗ        = st3x3(H, ix, iy)
        r[ix, iy] = residual(Hₗ, n)
    end
    return
end

function cpu_runme()
    nx, ny = 32, 8
    # power law exponent
    n = 3
    # arrays
    H = collect(2000 * Float64(i + j) for i in 1:nx, j in 1:ny)
    r = zeros(Float64, nx, ny)
    # shadows
    r̄1 = ones(Float64, nx, ny)
    r̄2 = ones(Float64, nx, ny)
    H̄1 = zeros(Float64, nx, ny)
    H̄2 = zeros(Float64, nx, ny)

    cpu_residual!(r, H, n)
    Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄1), DupNN(H, H̄1), Const(n))

    for i in 1:1000
        r̄2 .= 1.0
        H̄2 .= 0.0
        cpu_residual!(r, H, n)
        Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄2), DupNN(H, H̄2), Const(n))

        if H̄1 != H̄2
            @show H̄1
            @show H̄1 .- H̄2
            error("CPU: non-deterministic results at iteration $i")
        end
    end

    println("CPU: no errors")

    return
end

cpu_runme()

Julia version

The same behaviour appears on Julia 1.11.2.

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD EPYC 7282 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
  JULIA_HDF5_PATH = /scratch-1/soft/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.4.0/hdf5-1.14.3-hzicuphosrqicmxhl66fzdjdkjkmzxdy
  JULIA_NUM_THREADS = 4
  JULIA_LOAD_PATH = :/scratch-1/julia_prefs/:/scratch-1/julia_prefs/
  JULIA_EDITOR = code

CUDA version

CUDA runtime 12.3, local installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 2023.3.1 (API 21.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: local

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 38.895 GiB / 40.000 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 30.823 GiB / 40.000 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)

Enzyme version

Enzyme v0.13.24

The text was updated successfully, but these errors were encountered:

wsmoses · 2024-12-18T19:43:57Z

Interesting, would you be able to figure out the correct result [e.g. the first run] and we can check against it absolutely (and it'll make it easier to figure out where it goes wrong).

utkinis · 2024-12-19T14:32:30Z

Sorry for false alarming, I suspect the issue is not Enzyme-related, but floating-point precision related. I realised that I have a function with values of large magnitude, but very low sensitivity to arguments, which then gets multiplied by huge values of the order of primal function values in the solver. CPU code still produces consistent values even when it's numerical garbage, while CUDA seems to have more relaxed floating-point rules, so produces different results between kernel invocations. I didn't expect to have such issues with Float64 tbh 😅 Closing the issue for now.

utkinis closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic results with reverse-mode `autodiff` on CUDA #2210

Non-deterministic results with reverse-mode `autodiff` on CUDA #2210

utkinis commented Dec 18, 2024

wsmoses commented Dec 18, 2024

utkinis commented Dec 19, 2024

Non-deterministic results with reverse-mode autodiff on CUDA #2210

Non-deterministic results with reverse-mode autodiff on CUDA #2210

Comments

utkinis commented Dec 18, 2024

Bug description

MWE

Julia version

CUDA version

Enzyme version

wsmoses commented Dec 18, 2024

utkinis commented Dec 19, 2024

Non-deterministic results with reverse-mode `autodiff` on CUDA #2210

Non-deterministic results with reverse-mode `autodiff` on CUDA #2210