You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When computing pullbacks on CUDA kernels, sometimes (how often depends on the code and launch configuration) the results are incorrect. The same pullbacks are computed just fine on single-threaded CPU. With reducing block size, the bug seems to appear less often or doesn't appear at all. Could it be that the register spills aren't handled correctly in the AD path? The forward kernel in my example uses ~90 registers, and the adjoint kernel uses ~190 registers.
MWE
Sorry for the long snippet, the bug is triggered consistently on my main computational code, and this is my best attempt to reduce the amount of code while still having the bug:
using Enzyme, CUDA, StaticArrays
# helper functions for static arraysδˣₐ(S::StaticMatrix{3,<:Any}) = S[SVector(2, 3), :] .- S[SVector(1, 2), :]
δʸₐ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2, 3)] .- S[:, SVector(1, 2)]
δˣ(S::StaticMatrix{3,3}) = S[SVector(2, 3), SVector(2)] .- S[SVector(1, 2), SVector(2)]
δʸ(S::StaticMatrix{3,3}) = S[SVector(2), SVector(2, 3)] .- S[SVector(2), SVector(1, 2)]
δˣ(S::StaticMatrix{2,1}) = S[2] - S[1]
δʸ(S::StaticMatrix{1,2}) = S[2] - S[1]
functionav4(S::StaticMatrix{2,3})
0.25.* (S[SVector(1), SVector(1, 2)] .+ S[SVector(2), SVector(1, 2)] .+
S[SVector(1), SVector(2, 3)] .+ S[SVector(2), SVector(2, 3)])
endfunctionav4(S::StaticMatrix{3,2})
0.25.* (S[SVector(1, 2), SVector(1)] .+ S[SVector(1, 2), SVector(2)] .+
S[SVector(2, 3), SVector(1)] .+ S[SVector(2, 3), SVector(2)])
endinnˣ(S::StaticMatrix{3,<:Any}) = S[SVector(2), :]
innʸ(S::StaticMatrix{<:Any,3}) = S[:, SVector(2)]
# extract 3x3 stencilfunctionst3x3(M, ix, iy)
nx, ny =oftype.((ix, iy), size(M))
# neighbor indices
di =oneunit(ix)
dj =oneunit(iy)
iW =max(ix - di, di)
iE =min(ix + di, nx)
iS =max(iy - dj, dj)
iN =min(iy + dj, ny)
returnSMatrix{3,3}(M[iW, iS], M[ix, iS], M[iE, iS],
M[iW, iy], M[ix, iy], M[iE, iy],
M[iW, iN], M[ix, iN], M[iE, iN])
end# Enzyme utils∇(fun, args::Vararg{Any,N}) where {N} = (Enzyme.autodiff_deferred(Enzyme.Reverse, Const(fun), Const, args...); return)
const DupNN = DuplicatedNoNeed
functionresidual(H, n)
# surface gradient
∇Hˣ =δˣₐ(H)
∇Hʸ =δʸₐ(H)
# surface gradient magnitude
∇Sˣ =sqrt.(innʸ(∇Hˣ) .^2.+av4(∇Hʸ) .^2) .^ (n -1)
∇Sʸ =sqrt.(av4(∇Hˣ) .^2.+innˣ(∇Hʸ) .^2) .^ (n -1)
qˣ = ∇Sˣ .*δˣ(H .^ n)
qʸ = ∇Sʸ .*δʸ(H .^ n)
r =δˣ(qˣ) +δʸ(qʸ)
return r
endfunctiongpu_residual!(r, H, n)
ix = (blockIdx().x -Int32(1)) *blockDim().x +threadIdx().x
iy = (blockIdx().y -Int32(1)) *blockDim().y +threadIdx().y
Hₗ =st3x3(H, ix, iy)
r[ix, iy] =residual(Hₗ, n)
returnendfunctiongpu_runme()
nthreads =32, 8# triggered much less often with 32, 4
nx, ny = nthreads
# power law exponent
n =3# arrays
H =CuArray(collect(2000*Float64(i + j) for i in1:nx, j in1:ny))
r = CUDA.zeros(Float64, nx, ny)
# shadows
r̄1 = CUDA.ones(Float64, nx, ny)
r̄2 = CUDA.ones(Float64, nx, ny)
H̄1 = CUDA.zeros(Float64, nx, ny)
H̄2 = CUDA.zeros(Float64, nx, ny)
@cuda threads = nthreads gpu_residual!(r, H, n)
@cuda threads = nthreads ∇(gpu_residual!, DupNN(r, r̄1), DupNN(H, H̄1), Const(n))
for i in1:1000
r̄2 .=1.0
H̄2 .=0.0@cuda threads = nthreads gpu_residual!(r, H, n)
@cuda threads = nthreads ∇(gpu_residual!, DupNN(r, r̄2), DupNN(H, H̄2), Const(n))
if H̄1 != H̄2
display(H̄1)
display(H̄2)
display(H̄1 .- H̄2)
error("CUDA: non-deterministic results at iteration $i")
endendprintln("CUDA: no errors")
returnendgpu_runme()
# The code below is optionalfunctioncpu_residual!(r, H, n)
nx, ny =size(H)
for ix in1:nx, iy in1:ny
Hₗ =st3x3(H, ix, iy)
r[ix, iy] =residual(Hₗ, n)
endreturnendfunctioncpu_runme()
nx, ny =32, 8# power law exponent
n =3# arrays
H =collect(2000*Float64(i + j) for i in1:nx, j in1:ny)
r =zeros(Float64, nx, ny)
# shadows
r̄1 =ones(Float64, nx, ny)
r̄2 =ones(Float64, nx, ny)
H̄1 =zeros(Float64, nx, ny)
H̄2 =zeros(Float64, nx, ny)
cpu_residual!(r, H, n)
Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄1), DupNN(H, H̄1), Const(n))
for i in1:1000
r̄2 .=1.0
H̄2 .=0.0cpu_residual!(r, H, n)
Enzyme.autodiff(Enzyme.Reverse, Const(cpu_residual!), DupNN(r, r̄2), DupNN(H, H̄2), Const(n))
if H̄1 != H̄2
@show H̄1
@show H̄1 .- H̄2
error("CPU: non-deterministic results at iteration $i")
endendprintln("CPU: no errors")
returnendcpu_runme()
Interesting, would you be able to figure out the correct result [e.g. the first run] and we can check against it absolutely (and it'll make it easier to figure out where it goes wrong).
Sorry for false alarming, I suspect the issue is not Enzyme-related, but floating-point precision related. I realised that I have a function with values of large magnitude, but very low sensitivity to arguments, which then gets multiplied by huge values of the order of primal function values in the solver. CPU code still produces consistent values even when it's numerical garbage, while CUDA seems to have more relaxed floating-point rules, so produces different results between kernel invocations. I didn't expect to have such issues with Float64 tbh 😅 Closing the issue for now.
Bug description
When computing pullbacks on CUDA kernels, sometimes (how often depends on the code and launch configuration) the results are incorrect. The same pullbacks are computed just fine on single-threaded CPU. With reducing block size, the bug seems to appear less often or doesn't appear at all. Could it be that the register spills aren't handled correctly in the AD path? The forward kernel in my example uses ~90 registers, and the adjoint kernel uses ~190 registers.
MWE
Sorry for the long snippet, the bug is triggered consistently on my main computational code, and this is my best attempt to reduce the amount of code while still having the bug:
Julia version
The same behaviour appears on Julia 1.11.2.
CUDA version
Enzyme version
Enzyme v0.13.24
The text was updated successfully, but these errors were encountered: