Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic results with reverse-mode autodiff on CUDA #2210

Closed
utkinis opened this issue Dec 18, 2024 · 2 comments
Closed

Non-deterministic results with reverse-mode autodiff on CUDA #2210

utkinis opened this issue Dec 18, 2024 · 2 comments

Comments

@utkinis
Copy link

utkinis commented Dec 18, 2024

Bug description

When computing pullbacks on CUDA kernels, sometimes (how often depends on the code and launch configuration) the results are incorrect. The same pullbacks are computed just fine on single-threaded CPU. With reducing block size, the bug seems to appear less often or doesn't appear at all. Could it be that the register spills aren't handled correctly in the AD path? The forward kernel in my example uses ~90 registers, and the adjoint kernel uses ~190 registers.

MWE

Sorry for the long snippet, the bug is triggered consistently on my main computational code, and this is my best attempt to reduce the amount of code while still having the bug:

using Enzyme, CUDA, StaticArrays

# helper functions for static arrays
δˣₐ(S::StaticMatrix{3,<:Any}) = S[SVector(2, 3), :] .- S[SVector(1, 2), :]
δʸₐ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2, 3)] .- S[:, SVector(1, 2)]

δˣ(S::StaticMatrix{3,3}) = S[SVector(2, 3), SVector(2)] .- S[SVector(1, 2), SVector(2)]
δʸ(S::StaticMatrix{3,3}) = S[SVector(2), SVector(2, 3)] .- S[SVector(2), SVector(1, 2)]

δˣ(S::StaticMatrix{2,1}) = S[2] - S[1]
δʸ(S::StaticMatrix{1,2}) = S[2] - S[1]

function av4(S::StaticMatrix{2,3})
    0.25 .* (S[SVector(1), SVector(1, 2)] .+ S[SVector(2), SVector(1, 2)] .+
             S[SVector(1), SVector(2, 3)] .+ S[SVector(2), SVector(2, 3)])
end

function av4(S::StaticMatrix{3,2})
    0.25 .* (S[SVector(1, 2), SVector(1)] .+ S[SVector(1, 2), SVector(2)] .+
             S[SVector(2, 3), SVector(1)] .+ S[SVector(2, 3), SVector(2)])
end

innˣ(S::StaticMatrix{3,<:Any}) = S[SVector(2), :]
innʸ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2)]

# extract 3x3 stencil
function st3x3(M, ix, iy)
    nx, ny = oftype.((ix, iy), size(M))
    # neighbor indices
    di = oneunit(ix)
    dj = oneunit(iy)
    iW = max(ix - di, di)
    iE = min(ix + di, nx)
    iS = max(iy - dj, dj)
    iN = min(iy + dj, ny)
    return SMatrix{3,3}(M[iW, iS], M[ix, iS], M[iE, iS],
                        M[iW, iy], M[ix, iy], M[iE, iy],
                        M[iW, iN], M[ix, iN], M[iE, iN])
end

# Enzyme utils
(fun, args::Vararg{Any,N}) where {N} = (Enzyme.autodiff_deferred(Enzyme.Reverse, Const(fun), Const, args...); return)
const DupNN = DuplicatedNoNeed

function residual(H, n)
    # surface gradient
    ∇Hˣ = δˣₐ(H)
    ∇Hʸ = δʸₐ(H)

    # surface gradient magnitude
    ∇Sˣ = sqrt.(innʸ(∇Hˣ) .^ 2 .+ av4(∇Hʸ) .^ 2) .^ (n - 1)
    ∇Sʸ = sqrt.(av4(∇Hˣ) .^ 2 .+ innˣ(∇Hʸ) .^ 2) .^ (n - 1)

    qˣ = ∇Sˣ .* δˣ(H .^ n)
    qʸ = ∇Sʸ .* δʸ(H .^ n)

    r = δˣ(qˣ) + δʸ(qʸ)

    return r
end

function gpu_residual!(r, H, n)
    ix = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    iy = (blockIdx().y - Int32(1)) * blockDim().y + threadIdx().y

    Hₗ        = st3x3(H, ix, iy)
    r[ix, iy] = residual(Hₗ, n)

    return
end

function gpu_runme()
    nthreads = 32, 8 # triggered much less often with 32, 4
    nx, ny   = nthreads
    # power law exponent
    n = 3
    # arrays
    H = CuArray(collect(2000 * Float64(i + j) for i in 1:nx, j in 1:ny))
    r = CUDA.zeros(Float64, nx, ny)
    # shadows
    r̄1 = CUDA.ones(Float64, nx, ny)
    r̄2 = CUDA.ones(Float64, nx, ny)
    H̄1 = CUDA.zeros(Float64, nx, ny)
    H̄2 = CUDA.zeros(Float64, nx, ny)

    @cuda threads = nthreads gpu_residual!(r, H, n)
    @cuda threads = nthreads (gpu_residual!, DupNN(r, r̄1), DupNN(H, H̄1), Const(n))

    for i in 1:1000
        r̄2 .= 1.0
        H̄2 .= 0.0
        @cuda threads = nthreads gpu_residual!(r, H, n)
        @cuda threads = nthreads (gpu_residual!, DupNN(r, r̄2), DupNN(H, H̄2), Const(n))

        if H̄1 != H̄2
            display(H̄1)
            display(H̄2)
            display(H̄1 .- H̄2)
            error("CUDA: non-deterministic results at iteration $i")
        end
    end

    println("CUDA: no errors")

    return
end

gpu_runme()

# The code below is optional
function cpu_residual!(r, H, n)
    nx, ny = size(H)
    for ix in 1:nx, iy in 1:ny
        Hₗ        = st3x3(H, ix, iy)
        r[ix, iy] = residual(Hₗ, n)
    end
    return
end

function cpu_runme()
    nx, ny = 32, 8
    # power law exponent
    n = 3
    # arrays
    H = collect(2000 * Float64(i + j) for i in 1:nx, j in 1:ny)
    r = zeros(Float64, nx, ny)
    # shadows
    r̄1 = ones(Float64, nx, ny)
    r̄2 = ones(Float64, nx, ny)
    H̄1 = zeros(Float64, nx, ny)
    H̄2 = zeros(Float64, nx, ny)

    cpu_residual!(r, H, n)
    Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄1), DupNN(H, H̄1), Const(n))

    for i in 1:1000
        r̄2 .= 1.0
        H̄2 .= 0.0
        cpu_residual!(r, H, n)
        Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄2), DupNN(H, H̄2), Const(n))

        if H̄1 != H̄2
            @show H̄1
            @show H̄1 .- H̄2
            error("CPU: non-deterministic results at iteration $i")
        end
    end

    println("CPU: no errors")

    return
end

cpu_runme()

Julia version

The same behaviour appears on Julia 1.11.2.

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD EPYC 7282 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
  JULIA_HDF5_PATH = /scratch-1/soft/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.4.0/hdf5-1.14.3-hzicuphosrqicmxhl66fzdjdkjkmzxdy
  JULIA_NUM_THREADS = 4
  JULIA_LOAD_PATH = :/scratch-1/julia_prefs/:/scratch-1/julia_prefs/
  JULIA_EDITOR = code

CUDA version

CUDA runtime 12.3, local installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 2023.3.1 (API 21.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: local

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 38.895 GiB / 40.000 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 30.823 GiB / 40.000 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 30.848 GiB / 40.000 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.385 GiB / 40.000 GiB available)

Enzyme version

Enzyme v0.13.24

@wsmoses
Copy link
Member

wsmoses commented Dec 18, 2024

Interesting, would you be able to figure out the correct result [e.g. the first run] and we can check against it absolutely (and it'll make it easier to figure out where it goes wrong).

@utkinis
Copy link
Author

utkinis commented Dec 19, 2024

Sorry for false alarming, I suspect the issue is not Enzyme-related, but floating-point precision related. I realised that I have a function with values of large magnitude, but very low sensitivity to arguments, which then gets multiplied by huge values of the order of primal function values in the solver. CPU code still produces consistent values even when it's numerical garbage, while CUDA seems to have more relaxed floating-point rules, so produces different results between kernel invocations. I didn't expect to have such issues with Float64 tbh 😅 Closing the issue for now.

@utkinis utkinis closed this as completed Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants