Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceil or % error in cuda backpropagation #1259

Closed
jakubMitura14 opened this issue Jan 29, 2024 · 14 comments
Closed

ceil or % error in cuda backpropagation #1259

jakubMitura14 opened this issue Jan 29, 2024 · 14 comments

Comments

@jakubMitura14
Copy link

jakubMitura14 commented Jan 29, 2024

Hello I have some strange issue when differentiating through CUDA kernel; kernel executes ok but during backprop if I use in kernel ceil floor or "%" operation or

ccall("extern __nv_ceilf", llvmcall, Cfloat, (Cfloat,), x)

the application breaks - Hovewer when I use round all compiles without problem.
Usually Julia just crashes; or reports issues with garbage collector
I use Julia 1.10 Ubuntu 20.04 RTX 3090 gpu and fresh installation of new CUDA (v5.1.2) and Enzyme (v0.11.12)

code to reproduce

using Pkg
using ChainRulesCore,Zygote,CUDA,Enzyme
using KernelAbstractions
using Zygote, Lux,LuxCUDA
using Lux, Random
import NNlib, Optimisers, Plots, Random, Statistics, Zygote
using FillArrays
using LinearAlgebra
using Images,ImageFiltering
Pkg.add(url="https://github.com/EnzymeAD/Enzyme.jl.git")

#### test data
Nx, Ny, Nz = 8, 8, 8
threads = (2,2,2)
blocks = (1, 1, 1)
rng = Random.default_rng()

function sigmoid(x::Float32)::Float32
    return 1 / (1 + exp(-x))
end

#### main dummy kernel
function testKern(prim_A,A, p, Aout,Nx)
    x = (threadIdx().x + ((blockIdx().x - 1) * CUDA.blockDim_x())) 
    y = (threadIdx().y + ((blockIdx().y - 1) * CUDA.blockDim_y())) 
    z = (threadIdx().z + ((blockIdx().z - 1) * CUDA.blockDim_z())) 

    Aout[(x-1)*4+(y-1)*2+z]=(ceil(A[x,y,z]))

    return nothing
end

function testKernDeff( prim_A,dprim_A,A, dA, p
    , dp, Aout
    , dAout,Nx)
    Enzyme.autodiff_deferred(Reverse,testKern, Const, Duplicated(prim_A, dprim_A),Duplicated(A, dA), Duplicated(p, dp), Duplicated(Aout, dAout),Const(Nx) )
    return nothing
end


function calltestKern(prim_A,A, p,Nx)
    Aout = CUDA.zeros(Float32,8) 
    @cuda threads = threads blocks = blocks testKern(prim_A, A, p,  Aout,Nx)
    return Aout
end



# rrule for ChainRules.
function ChainRulesCore.rrule(::typeof(calltestKern),prim_A, A, p,Nx)
    

    Aout = calltestKern(prim_A,A, p,Nx)
    function call_test_kernel1_pullback(dAout)
        threads = (2, 2,2)
        blocks = (1, 1, 1)

        dp = CUDA.ones(size(p))
        dprim_A = CUDA.ones(size(prim_A))
        dA = CUDA.ones(size(A))
        @cuda threads = threads blocks = blocks testKernDeff(prim_A,dprim_A, A, dA, p, dp, Aout, CuArray(collect(dAout)),Nx)

        f̄ = NoTangent()
        x̄ = dA
        ȳ = dp
        
        return dprim_A,f̄, x̄, ȳ,NoTangent()
    end   
    return Aout, call_test_kernel1_pullback

end


#lux layers from http://lux.csail.mit.edu/dev/manual/interface/
struct KernelAstr<: Lux.AbstractExplicitLayer
    confA::Int
end

function KernelA(confA)
    return KernelAstr(confA)
end

function Lux.initialparameters(rng::AbstractRNG, l::KernelAstr)
    return (paramsA=CuArray(rand(rng,Float32, 3,8))
    ,Nx =l.confA )
end

function Lux.initialstates(::AbstractRNG, l::KernelAstr)::NamedTuple
    return (NxSt=l.confA , )
end

function (l::KernelAstr)(x, ps, st::NamedTuple)
    x,prim_a= x
    return calltestKern(prim_a,x, ps.paramsA,ps.Nx),st
end




conv1 = (in, out) -> Lux.Conv((3,3,3),  in => out , NNlib.tanh, stride=1, pad=Lux.SamePad())
conv2 = (in, out) -> Lux.Conv((3,3,3),  in => out , NNlib.tanh, stride=2, pad=Lux.SamePad())

# model = Lux.Chain(KernelA(Nx),KernelA(Nx)) 
function connection_before_kernelA(x,y)
    return (x,y)
end


arr = collect(range(1, stop = Nx*Ny*Nz))
arr=reshape(arr,(Nx,Ny,Nz,1,1))
arr=Float32.(arr)
x = arr
x= CuArray(x)

dev = gpu_device()
model = Lux.Chain(SkipConnection(Lux.Chain(conv1(1,3),conv2(3,3),conv2(3,3)) , connection_before_kernelA; name="prim_convs"),KernelA(Nx)) 

ps, st = Lux.setup(rng, model) .|> dev
opt = Optimisers.Adam(0.03)
opt_st = Optimisers.setup(opt, ps) |> dev
vjp_rule = Lux.Training.AutoZygote()
y_pred, st = Lux.apply(model, x, ps, st)

"""
extremely simple loss function we just want to get the result to be as close to 100 as possible
"""
function loss_function(model, ps, st, x)
    y_pred, st = Lux.apply(model, x, ps, st)
    return (sum(y_pred)), st, ()

end

function main(ps, st,opt,opt_st , vjp, data,model,
    epochs::Int)
    x = CuArray(data) #.|> Lux.gpu
    for epoch in 1:epochs

        (loss, st), back = Zygote.pullback(p -> loss_function(model, p, st, x), ps)
        gs = back((one(loss), nothing))[1]
        opt_st, ps = Optimisers.update(opt_st, ps, gs)

        @info epoch=epoch loss=loss 
    end
    return ps, st,opt,opt_st 
end
# one epoch just to check if it runs
ps, st,opt,opt_st  = main(ps, st,opt,opt_st , vjp_rule, x,model,1)

using ceil function in kernel give error like below

ERROR: InvalidIRError: compiling MethodInstance for testKernDeff(::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
 [1] #ceil
   @ ~/.julia/packages/CUDA/rXson/src/device/intrinsics/math.jl:286
 [2] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
 [3] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
 [4] diffejulia_testKern_11872_inner2wrap
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] #ceil
    @ ~/.julia/packages/CUDA/rXson/src/device/intrinsics/math.jl:286
  [2] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
  [3] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [4] diffejulia_testKern_11872_inner2wrap
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [5] macro expansion
    @ ~/.julia/packages/Enzyme/KJgKj/src/compiler.jl:5310
  [6] enzyme_call
    @ ~/.julia/packages/Enzyme/KJgKj/src/compiler.jl:4988
  [7] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/KJgKj/src/compiler.jl:4930
  [8] autodiff_deferred
    @ ~/.julia/packages/Enzyme/KJgKj/src/Enzyme.jl:366
  [9] autodiff_deferred
    @ ~/.julia/packages/Enzyme/KJgKj/src/Enzyme.jl:436
 [10] testKernDeff
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:44
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:440 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:439 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:92
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:86 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:129
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
  [9] compile
    @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
 [10] #1075
    @ ~/.julia/packages/CUDA/rXson/src/compiler/compilation.jl:247 [inlined]
 [11] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
 [12] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/rXson/src/compiler/compilation.jl:246
 [13] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
 [14] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
 [15] macro expansion
    @ ~/.julia/packages/CUDA/rXson/src/compiler/execution.jl:359 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] cufunction(f::typeof(testKernDeff), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/rXson/src/compiler/execution.jl:354
 [18] cufunction
    @ ~/.julia/packages/CUDA/rXson/src/compiler/execution.jl:351 [inlined]
 [19] macro expansion
    @ ~/.julia/packages/CUDA/rXson/src/compiler/execution.jl:104 [inlined]
 [20] (::var"#call_test_kernel1_pullback#17"{…})(dAout::CuArray{…})
    @ Main /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:69
 [21] ZBack
    @ ~/.julia/packages/Zygote/jxHJc/src/compiler/chainrules.jl:211 [inlined]
 [22] KernelAstr
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:102 [inlined]
 [23] apply
    @ ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
 [24] macro expansion
    @ ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:0 [inlined]
 [25] applychain
    @ ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:480 [inlined]
 [26] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Tuple{CuArray{…}, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [27] Chain
    @ Lux ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:478 [inlined]
 [28] apply
    @ LuxCore ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
 [29] loss_function
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:136 [inlined]
 [30] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [31] #22
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:146 [inlined]
 [32] (::Zygote.Pullback{Tuple{var"#22#23"{…}, @NamedTuple{…}}, Any})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [33] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#22#23"{…}, @NamedTuple{…}}, Any}})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface.jl:91
 [34] main(ps::@NamedTuple{…}, st::@NamedTuple{…}, opt::Optimisers.Adam, opt_st::@NamedTuple{…}, vjp::ADTypes.AutoZygote, data::CuArray{…}, model::Chain{…}, epochs::Int64)
    @ Main /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:147
 [35] top-level scope
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:155
Some type information was truncated. Use `show(err)` to see complete types.

using x%1 in kernel gives error like below

so changing line

Aout[(x-1)*4+(y-1)*2+z]=(ceil(A[x,y,z]))

into

Aout[(x-1)*4+(y-1)*2+z]=(A[x,y,z]%1)

julia: /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:1953: const AugmentedReturn& EnzymeLogic::CreateAugmentedPrimal(RequestContext, llvm::Function*, DIFFE_TYPE, llvm::ArrayRef<DIFFE_TYPE>, TypeAnalysis&, bool, bool, const FnTypeInfo&, std::vector<bool>, bool, unsigned int, bool, bool): Assertion `_overwritten_args.size() == todiff->arg_size()' failed.

[16605] signal (6.-6): Aborted
in expression starting at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/lin_sampl_main_run.jl:83
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fed2beb8728)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
CreateAugmentedPrimal at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:1953
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:5284
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6315
delegateCallInst at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:301 [inlined]
visitCall at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/Instruction.def:209 [inlined]
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/Instruction.def:209
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreatePrimalAndGradient at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4238
EnzymeCreatePrimalAndGradient at /workspace/srcdir/Enzyme/enzyme/Enzyme/CApi.cpp:583
EnzymeCreatePrimalAndGradient at /home/jm/.julia/packages/Enzyme/KJgKj/src/api.jl:141
unknown function (ip: 0x7feb6bb93a1b)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
enzyme! at /home/jm/.julia/packages/Enzyme/KJgKj/src/compiler.jl:3133
unknown function (ip: 0x7feb6bb8ef19)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#codegen#469 at /home/jm/.julia/packages/Enzyme/KJgKj/src/compiler.jl:4767
codegen at /home/jm/.julia/packages/Enzyme/KJgKj/src/compiler.jl:4348 [inlined]
#153 at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:255
get! at ./dict.jl:479
unknown function (ip: 0x7feb6bb5f900)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
macro expansion at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:254 [inlined]
#emit_llvm#152 at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:92
unknown function (ip: 0x7feb8c132796)
unknown function (ip: 0x7feb8d8dcb49)
unknown function (ip: 0x7feb8d8dcb1f)
emit_llvm at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:86 [inlined]
#codegen#150 at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:129
codegen at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:110
unknown function (ip: 0x7feb8d8dcbe0)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#compile#149 at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
compile at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
#1075 at /home/jm/.julia/packages/CUDA/rXson/src/compiler/compilation.jl:247 [inlined]
JuliaContext at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
unknown function (ip: 0x7feb8d8d9915)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
compile at /home/jm/.julia/packages/CUDA/rXson/src/compiler/compilation.jl:246
actual_compilation at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
unknown function (ip: 0x7feb8d8d9399)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
cached_compilation at /home/jm/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
macro expansion at /home/jm/.julia/packages/CUDA/rXson/src/compiler/execution.jl:359 [inlined]
macro expansion at ./lock.jl:267 [inlined]
#cufunction#1097 at /home/jm/.julia/packages/CUDA/rXson/src/compiler/execution.jl:354
cufunction at /home/jm/.julia/packages/CUDA/rXson/src/compiler/execution.jl:351 [inlined]
macro expansion at /home/jm/.julia/packages/CUDA/rXson/src/compiler/execution.jl:104 [inlined]
call_test_kernel1_pullback at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/dif_custom_kern.jl:42
ZBack at /home/jm/.julia/packages/Zygote/jxHJc/src/compiler/chainrules.jl:211 [inlined]
KernelAstr at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/Lux_model.jl:40 [inlined]
apply at /home/jm/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
macro expansion at /home/jm/.julia/packages/Lux/UQeEs/src/layers/containers.jl:0 [inlined]
applychain at /home/jm/.julia/packages/Lux/UQeEs/src/layers/containers.jl:480 [inlined]
Pullback at /home/jm/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
Chain at /home/jm/.julia/packages/Lux/UQeEs/src/layers/containers.jl:478 [inlined]
apply at /home/jm/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
loss_function at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/lin_sampl_main_run.jl:55 [inlined]
Pullback at /home/jm/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
unknown function (ip: 0x7feb6bb0143e)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#25 at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/lin_sampl_main_run.jl:65 [inlined]
Pullback at /home/jm/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
#75 at /home/jm/.julia/packages/Zygote/jxHJc/src/compiler/interface.jl:91
unknown function (ip: 0x7feb8c1d6295)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
main at /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/lin_sampl/lin_sampl_main_run.jl:66
unknown function (ip: 0x7feb8c199288)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46343.1 at /home/jm/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82703.1 at /home/jm/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 79137883 (Pool: 79030029; Big: 107854); GC: 62
Aborted (core dumped)

changing line

Aout[(x-1)*4+(y-1)*2+z]=(A[x,y,z]%1)

into

Aout[(x-1)*4+(y-1)*2+z]=(round(A[x,y,z]))

and all is working

@wsmoses
Copy link
Member

wsmoses commented Jan 29, 2024

This should hopefully fix the ceil one: EnzymeAD/Enzyme#1647

The mod one is more strange.

@jakubMitura14
Copy link
Author

Thank toy for fast response ! @wsmoses
For the record after getting enzyme from github
my version now is Enzyme v0.11.14 https://github.com/EnzymeAD/Enzyme.jl.git#main
ceil is still not working

@vchuravy
Copy link
Member

You need to either build Enzyme from source https://github.com/EnzymeAD/Enzyme.jl/blob/main/deps/build_local.jl
or wait until we released a new version of Enzyme proper

@wsmoses
Copy link
Member

wsmoses commented Feb 10, 2024

So the Enzyme_jll has now been landed so you should see the fix if you use Enzyme.jl#main.

However I cannot reproduce your issue locally, instead running into:

julia> model = Lux.Chain(SkipConnection(Lux.Chain(conv1(1,3),conv2(3,3),conv2(3,3)) , connection_before_kernelA; name="prim_convs"),KernelA(Nx))
Chain(
    layer_1 = prim_convs(
        Chain(
            layer_1 = Conv((3, 3, 3), 1 => 3, tanh_fast, pad=1),  # 84 parameters
            layer_2 = Conv((3, 3, 3), 3 => 3, tanh_fast, pad=1, stride=2),  # 246 parameters
            layer_3 = Conv((3, 3, 3), 3 => 3, tanh_fast, pad=1, stride=2),  # 246 parameters
        ),
        connection_before_kernelA
    ),
    layer_2 = KernelAstr(),Error showing value of type Chain{@NamedTuple{layer_1::SkipConnection{String, Chain{@NamedTuple{layer_1::Conv{3, true, 6, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}, layer_2::Conv{3, true, 6, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}, layer_3::Conv{3, true, 6, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}}, Nothing}, typeof(connection_before_kernelA)}, layer_2::KernelAstr}, Nothing}:
ERROR: MethodError: no method matching parameterlength(::Int64)

Closest candidates are:
  parameterlength(::Bilinear{use_bias}) where use_bias
   @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/basic.jl:395
  parameterlength(::Conv{N, use_bias}) where {N, use_bias}
   @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/conv.jl:115
  parameterlength(::InstanceNorm)
   @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/normalize.jl:346
  ...

Stacktrace:
  [1] MappingRF
    @ ./reduce.jl:100 [inlined]
  [2] afoldl(::Base.MappingRF{typeof(LuxCore.parameterlength), Base.BottomRF{typeof(Base.add_sum)}}, ::Base._InitialValue, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Int64)
    @ Base ./operators.jl:545
  [3] _foldl_impl(op::Base.MappingRF{typeof(LuxCore.parameterlength), Base.BottomRF{typeof(Base.add_sum)}}, init::Base._InitialValue, itr::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64})
    @ Base ./reduce.jl:68
  [4] foldl_impl(op::Base.MappingRF{typeof(LuxCore.parameterlength), Base.BottomRF{typeof(Base.add_sum)}}, nt::Base._InitialValue, itr::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64})
    @ Base ./reduce.jl:48
  [5] mapfoldl_impl(f::typeof(LuxCore.parameterlength), op::typeof(Base.add_sum), nt::Base._InitialValue, itr::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64})
    @ Base ./reduce.jl:44
  [6] mapfoldl(f::Function, op::Function, itr::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64}; init::Base._InitialValue)
    @ Base ./reduce.jl:175
  [7] mapfoldl
    @ Base ./reduce.jl:175 [inlined]
  [8] mapreduce
    @ Base ./reduce.jl:307 [inlined]
  [9] sum(f::Function, a::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64})
    @ Base ./reduce.jl:535
 [10] parameterlength(nt::@NamedTuple{paramsA::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nx::Int64})
    @ LuxCore ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:82
 [11] parameterlength(l::KernelAstr)
    @ LuxCore ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:79
 [12] _layer_show(io::IOContext{Base.TTY}, layer::KernelAstr, indent::Int64, name::Symbol)
    @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/display.jl:91
 [13] _big_show(io::IOContext{Base.TTY}, obj::KernelAstr, indent::Int64, name::Symbol)
    @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/display.jl:17
 [14] _big_show(io::IOContext{Base.TTY}, obj::Chain{@NamedTuple{layer_1::SkipConnection{String, Chain{@NamedTuple{…}, Nothing}, typeof(connection_before_kernelA)}, layer_2::KernelAstr}, Nothing}, indent::Int64, name::Nothing)
    @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/display.jl:22
 [15] _big_show(io::IOContext{Base.TTY}, obj::Chain{@NamedTuple{layer_1::SkipConnection{String, Chain{@NamedTuple{…}, Nothing}, typeof(connection_before_kernelA)}, layer_2::KernelAstr}, Nothing}, indent::Int64, name::Nothing)
    @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/display.jl:12 [inlined]
 [16] show(io::IOContext{Base.TTY}, ::MIME{Symbol("text/plain")}, x::Chain{@NamedTuple{layer_1::SkipConnection{String, Chain{@NamedTuple{…}, Nothing}, typeof(connection_before_kernelA)}, layer_2::KernelAstr}, Nothing})
    @ Lux ~/.julia/packages/Lux/1VJhT/src/layers/display.jl:3
 [17] (::REPL.var"#55#56"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:273
 [18] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:569
 [19] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:259
 [20] display(d::REPL.REPLDisplay, x::Any)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:278
 [21] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [22] #invokelatest#2
    @ Base ./essentials.jl:887 [inlined]
 [23] invokelatest
    @ Base ./essentials.jl:884 [inlined]
 [24] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:315
 [25] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:284
 [26] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:569
 [27] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:282
 [28] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:911
 [29] (::REPL.var"#98#108"{Regex, Regex, Int64, Int64, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt})(::REPL.LineEdit.MIState, ::Any, ::Vararg{Any})
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1248
 [30] #invokelatest#2
    @ Base ./essentials.jl:887 [inlined]
 [31] invokelatest
    @ Base ./essentials.jl:884 [inlined]
 [32] (::REPL.LineEdit.var"#27#28"{REPL.var"#98#108"{Regex, Regex, Int64, Int64, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt}, String})(s::Any, p::Any)
    @ REPL.LineEdit ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:1612
 [33] prompt!(term::REPL.Terminals.TextTerminal, prompt::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2749
 [34] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2651
 [35] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1312
 [36] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/REPL/src/REPL.jl:386
Some type information was truncated. Use `show(err)` to see complete types.

@jakubMitura14
Copy link
Author

jakubMitura14 commented Feb 10, 2024

Thanks! On my device problems also occur

Can you check on your device, in my case rounding works; round(x+0.5) also works which is basically equivalent to ceil
So if you can change a line

Aout[(x-1)*4+(y-1)*2+z]=(A[x,y,z]%1)

into

Aout[(x-1)*4+(y-1)*2+z]=(round(A[x,y,z]))

or

Aout[(x-1)*4+(y-1)*2+z]=(round(A[x,y,z]+0.5))

@wsmoses
Copy link
Member

wsmoses commented Feb 10, 2024

Can you past the error messages from what arises on your system?

@jakubMitura14
Copy link
Author

Of course

ERROR: InvalidIRError: compiling MethodInstance for testKernDeff(::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] #ceil
    @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
  [2] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
  [3] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [4] diffejulia_testKern_7487_inner2wrap
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [5] macro expansion
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:5285
  [6] enzyme_call
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4963
  [7] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4905
  [8] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:366
  [9] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:436
 [10] testKernDeff
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:44
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
 [1] #ceil
   @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
 [2] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
 [3] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
 [4] diffejulia_testKern_7487_inner2wrap
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:414 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:83 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:129
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:106
  [9] compile
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:98 [inlined]
 [10] #1037
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:104 [inlined]
 [11] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:47
 [12] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:103
 [13] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:125
 [14] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:103
 [15] macro expansion
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:318 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] cufunction(f::typeof(testKernDeff), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:313
 [18] cufunction
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:310 [inlined]
 [19] macro expansion
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:104 [inlined]
 [20] (::var"#call_test_kernel1_pullback#5"{…})(dAout::CuArray{…})
    @ Main /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:69
 [21] ZBack
    @ ~/.julia/packages/Zygote/jxHJc/src/compiler/chainrules.jl:211 [inlined]
 [22] KernelAstr
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:102 [inlined]
 [23] apply
    @ ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
 [24] macro expansion
    @ ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:0 [inlined]
 [25] applychain
    @ ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:480 [inlined]
 [26] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Tuple{CuArray{…}, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [27] Chain
    @ ~/.julia/packages/Lux/UQeEs/src/layers/containers.jl:478 [inlined]
 [28] apply
    @ ~/.julia/packages/LuxCore/aumFq/src/LuxCore.jl:115 [inlined]
 [29] loss_function
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:136 [inlined]
 [30] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [31] #10
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:146 [inlined]
 [32] (::Zygote.Pullback{Tuple{var"#10#11"{…}, @NamedTuple{…}}, Any})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [33] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#10#11"{…}, @NamedTuple{…}}, Any}})(Δ::Tuple{Float32, Nothing})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface.jl:91
 [34] main(ps::@NamedTuple{…}, st::@NamedTuple{…}, opt::Optimisers.Adam, opt_st::@NamedTuple{…}, vjp::ADTypes.AutoZygote, data::CuArray{…}, model::Chain{…}, epochs::Int64)
    @ Main /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:147
 [35] top-level scope
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:155
Some type information was truncated. Use `show(err)` to see complete types.

@wsmoses
Copy link
Member

wsmoses commented Feb 10, 2024

And what version of Enzyme.jl are you on (since you need main what commit), and can you also show the output of st?

Specifically the thing to check for is being on Enzyme_jll 0.0.99

@jakubMitura14
Copy link
Author

I reloaded it by pkg add url="..."

(@v1.10) pkg> status Enzyme
Status `~/.julia/environments/v1.10/Project.toml`
  [7da242da] Enzyme v0.11.15 `https://github.com/EnzymeAD/Enzyme.jl.git#main`

@wsmoses
Copy link
Member

wsmoses commented Feb 10, 2024 via email

@jakubMitura14
Copy link
Author

with code like below

using Pkg
using ChainRulesCore,Zygote,CUDA,Enzyme
# using CUDAKernels
using KernelAbstractions
# using KernelGradients
using Zygote, Lux,LuxCUDA
using Lux, Random
import NNlib, Optimisers, Plots, Random, Statistics, Zygote
using FillArrays
using LinearAlgebra
using Images,ImageFiltering
# Pkg.add(url="https://github.com/EnzymeAD/Enzyme.jl.git")

#### test data
Nx, Ny, Nz = 8, 8, 8
threads = (2,2,2)
blocks = (1, 1, 1)

#### main dummy kernel
function testKern(prim_A,A, p, Aout,Nx)
    #adding one bewcouse of padding
    x = (threadIdx().x + ((blockIdx().x - 1) * CUDA.blockDim_x())) 
    y = (threadIdx().y + ((blockIdx().y - 1) * CUDA.blockDim_y())) 
    z = (threadIdx().z + ((blockIdx().z - 1) * CUDA.blockDim_z())) 

    Aout[(x-1)*4+(y-1)*2+z]=(ceil(A[x,y,z]))

    return nothing
end

function testKernDeff( prim_A,dprim_A,A, dA, p
    , dp, Aout
    , dAout,Nx)
    Enzyme.autodiff_deferred(Reverse,testKern, Const, Duplicated(prim_A, dprim_A),Duplicated(A, dA), Duplicated(p, dp), Duplicated(Aout, dAout),Const(Nx) )
    return nothing
end


function calltestKern(prim_A,A, p,Nx)
    Aout = CUDA.zeros(Float32,8) 
    @cuda threads = threads blocks = blocks testKern(prim_A, A, p,  Aout,Nx)
    return Aout
end



# rrule for ChainRules.
function ChainRulesCore.rrule(::typeof(calltestKern),prim_A, A, p,Nx)
    

    Aout = calltestKern(prim_A,A, p,Nx)
    function call_test_kernel1_pullback(dAout)
        threads = (2, 2,2)
        blocks = (1, 1, 1)

        dp = CUDA.ones(size(p))
        dprim_A = CUDA.ones(size(prim_A))
        dA = CUDA.ones(size(A))
        @cuda threads = threads blocks = blocks testKernDeff(prim_A,dprim_A, A, dA, p, dp, Aout, CuArray(collect(dAout)),Nx)

        f̄ = NoTangent()
        x̄ = dA
        ȳ = dp
        
        return dprim_A,f̄, x̄, ȳ,NoTangent()
    end   
    return Aout, call_test_kernel1_pullback

end

using ChainRulesCore

# Define the inputs
prim_A = CUDA.zeros(Float32, Nx, Ny, Nz)
A = CUDA.zeros(Float32, Nx, Ny, Nz)
p = CUDA.zeros(Float32, Nx, Ny, Nz)

# Compute the Jacobian
jacobian_result =Zygote.jacobian(calltestKern, prim_A, A, p, Nx)


error is

ERROR: InvalidIRError: compiling MethodInstance for testKernDeff(::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
 [1] #ceil
   @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
 [2] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
 [3] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
 [4] diffejulia_testKern_6744_inner2wrap
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] #ceil
    @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
  [2] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:36
  [3] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [4] diffejulia_testKern_6744_inner2wrap
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [5] macro expansion
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:5285
  [6] enzyme_call
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4963
  [7] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4905
  [8] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:366
  [9] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:436
 [10] testKernDeff
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:44
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:414 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:83 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:129
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:106
  [9] compile
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:98 [inlined]
 [10] #1037
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:104 [inlined]
 [11] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:47
 [12] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:103
 [13] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:125
 [14] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:103
 [15] macro expansion
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:318 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] cufunction(f::typeof(testKernDeff), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:313
 [18] cufunction
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:310 [inlined]
 [19] macro expansion
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:104 [inlined]
 [20] (::var"#call_test_kernel1_pullback#5"{…})(dAout::CuArray{…})
    @ Main /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:69
 [21] ZBack
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/chainrules.jl:211 [inlined]
 [22] #291
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/lib/lib.jl:206 [inlined]
 [23] (::Zygote.var"#2169#back#293"{Zygote.var"#291#292"{…}})(Δ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:72
 [24] call_composed
    @ ./operators.jl:1045 [inlined]
 [25] (::Zygote.Pullback{Tuple{…}, Any})(Δ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [26] call_composed
    @ ./operators.jl:1044 [inlined]
 [27] #_#103
    @ ./operators.jl:1041 [inlined]
 [28] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [29] #291
    @ ~/.julia/packages/Zygote/jxHJc/src/lib/lib.jl:206 [inlined]
 [30] #2169#back
    @ ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:72 [inlined]
 [31] ComposedFunction
    @ ./operators.jl:1041 [inlined]
 [32] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface2.jl:0
 [33] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{…}, Tuple{…}}})(Δ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/compiler/interface.jl:91
 [34] withjacobian(::Function, ::CuArray{…}, ::CuArray{…}, ::Vararg{…})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/lib/grad.jl:150
 [35] jacobian(::Function, ::CuArray{…}, ::CuArray{…}, ::Vararg{…})
    @ Zygote ~/.julia/packages/Zygote/jxHJc/src/lib/grad.jl:128
 [36] top-level scope
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:89
Some type information was truncated. Use `show(err)` to see complete types.

@wsmoses
Copy link
Member

wsmoses commented Feb 11, 2024

@jakubMitura14 not quite, I was thinking something more like this (which basically is just the Enzyme autodiff call):

using CUDA, Enzyme, Random
Enzyme.API.printall!(true)
Nx, Ny, Nz = 8, 8, 8
threads = (2,2,2)
blocks = (1, 1, 1)

#### main dummy kernel
function testKern(prim_A,A, p, Aout,Nx)
    #adding one bewcouse of padding
    x = (threadIdx().x + ((blockIdx().x - 1) * CUDA.blockDim_x())) 
    y = (threadIdx().y + ((blockIdx().y - 1) * CUDA.blockDim_y())) 
    z = (threadIdx().z + ((blockIdx().z - 1) * CUDA.blockDim_z())) 

    Aout[(x-1)*4+(y-1)*2+z]=(ceil(A[x,y,z]))

    return nothing
end

function testKernDeff( prim_A,dprim_A,A, dA, p
    , dp, Aout
    , dAout,Nx)
    Enzyme.autodiff_deferred(Reverse,testKern, Const, Duplicated(prim_A, dprim_A),Duplicated(A, dA), Duplicated(p, dp), Duplicated(Aout, dAout),Const(Nx) )
    return nothing
end


function calltestKern(prim_A,A, p,Nx)
    Aout = CUDA.zeros(Float32,8) 
    @cuda threads = threads blocks = blocks testKern(prim_A, A, p,  Aout,Nx)
    return Aout
end



prim_A = CUDA.zeros(Float32, Nx, Ny, Nz)
dprim_A = CUDA.zeros(Float32, Nx, Ny, Nz)
A = CUDA.zeros(Float32, Nx, Ny, Nz)
dA = CUDA.zeros(Float32, Nx, Ny, Nz)

p = CUDA.zeros(Float32, Nx, Ny, Nz)
dp = CUDA.zeros(Float32, Nx, Ny, Nz)
Aout = CUDA.zeros(Float32,8) 
dAout = CUDA.zeros(Float32,8) 

@cuda threads = threads blocks = blocks testKernDeff(prim_A,dprim_A, A, dA, p, dp, Aout, dAout,Nx)

@jakubMitura14
Copy link
Author

Thanks for example! Executing your code gave :

after simplification :
; Function Attrs: mustprogress willreturn
define void @preprocess_julia_testKern_4147_inner2({ i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, i64 signext "enzyme_inactive" %4) local_unnamed_addr #10 !dbg !249 {
entry:
  %.fca.2.0.extract13 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0, !dbg !250
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1, !dbg !250
  %.fca.3.extract17 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 3, !dbg !250
  %5 = call {}*** @julia.get_pgcstack() #11
  %6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #11, !dbg !251, !range !25
  %7 = add nuw nsw i32 %6, 1, !dbg !258
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #11, !dbg !259, !range !36
  %9 = zext i32 %8 to i64, !dbg !264
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #11, !dbg !266, !range !45
  %11 = zext i32 %10 to i64, !dbg !270
  %12 = mul nuw nsw i64 %9, %11, !dbg !272
  %13 = zext i32 %7 to i64, !dbg !274
  %14 = add nuw nsw i64 %12, %13, !dbg !276
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y() #11, !dbg !278, !range !25
  %16 = add nuw nsw i32 %15, 1, !dbg !284
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() #11, !dbg !285, !range !70
  %18 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y() #11, !dbg !290, !range !45
  %narrow = mul nuw nsw i32 %17, %18, !dbg !294
  %narrow37 = add nuw nsw i32 %16, %narrow, !dbg !296
  %19 = zext i32 %narrow37 to i64, !dbg !296
  %20 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z() #11, !dbg !298, !range !87
  %21 = add nuw nsw i32 %20, 1, !dbg !304
  %22 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() #11, !dbg !305, !range !70
  %23 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z() #11, !dbg !310, !range !100
  %narrow38 = mul nuw nsw i32 %22, %23, !dbg !314
  %narrow39 = add nuw nsw i32 %21, %narrow38, !dbg !316
  %24 = zext i32 %narrow39 to i64, !dbg !316
  %25 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract13, i64 noundef 0) #11, !dbg !318
  %26 = call i64 @llvm.smax.i64(i64 %.fca.2.1.extract15, i64 noundef 0) #11, !dbg !318
  %27 = add nsw i64 %19, -1, !dbg !329
  %28 = add nsw i64 %24, -1, !dbg !334
  %29 = mul i64 %26, %28, !dbg !337
  %reass.add = add i64 %29, %27
  %reass.mul = mul i64 %reass.add, %25
  %30 = add i64 %reass.mul, %14, !dbg !338
  %31 = call i64 @llvm.smax.i64(i64 %.fca.3.extract17, i64 noundef 0) #11, !dbg !339
  %32 = add i64 %30, -1, !dbg !349
  %.not = icmp ult i64 %32, %31, !dbg !351
  br i1 %.not, label %L95.i, label %L96.i, !dbg !352

L95.i:                                            ; preds = %entry
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, 2, 0, !dbg !250
  %.fca.0.extract9 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0, !dbg !250
  %33 = bitcast i8 addrspace(1)* %.fca.0.extract9 to float addrspace(1)*, !dbg !353
  %34 = getelementptr inbounds float, float addrspace(1)* %33, i64 %32, !dbg !353
  %35 = load float, float addrspace(1)* %34, align 4, !dbg !353, !tbaa !174
  %36 = call float @__nv_ceilf(float %35) #11, !dbg !359
  %37 = shl nuw nsw i64 %14, 2, !dbg !360
  %38 = shl nuw nsw i64 %27, 1, !dbg !360
  %39 = add nsw i64 %24, -4, !dbg !360
  %40 = add nsw i64 %39, %37, !dbg !361
  %41 = add nuw nsw i64 %40, %38, !dbg !361
  %42 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #11, !dbg !363
  %43 = add nsw i64 %41, -1, !dbg !376
  %.not40 = icmp ult i64 %43, %42, !dbg !378
  br i1 %.not40, label %julia_testKern_4147_inner.exit, label %L125.i, !dbg !379

L96.i:                                            ; preds = %entry
  call fastcc void @julia__throw_boundserror_4174() #12, !dbg !352
  unreachable, !dbg !352

L125.i:                                           ; preds = %L95.i
  call fastcc void @julia__throw_boundserror_4177() #12, !dbg !379
  unreachable, !dbg !379

julia_testKern_4147_inner.exit:                   ; preds = %L95.i
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, 0, !dbg !250
  %44 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*, !dbg !380
  %45 = getelementptr inbounds float, float addrspace(1)* %44, i64 %43, !dbg !380
  store float %36, float addrspace(1)* %45, align 4, !dbg !380, !tbaa !174, !noalias !386
  ret void, !dbg !250
}

; Function Attrs: mustprogress willreturn
define internal void @diffejulia_testKern_4147_inner2({ i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %"'", { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %"'1", { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, { i8 addrspace(1)*, i64, [3 x i64], i64 } %"'2", { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'3", i64 signext "enzyme_inactive" %4) local_unnamed_addr #10 !dbg !389 {
entry:
  %"'de" = alloca float, align 4
  %5 = getelementptr float, float* %"'de", i64 0
  store float 0.000000e+00, float* %5, align 4
  %"'de4" = alloca float, align 4
  %6 = getelementptr float, float* %"'de4", i64 0
  store float 0.000000e+00, float* %6, align 4
  %.fca.2.0.extract13 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0, !dbg !390
  %.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1, !dbg !390
  %.fca.3.extract17 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 3, !dbg !390
  %7 = call {}*** @julia.get_pgcstack() #11
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #11, !dbg !391, !range !25
  %9 = add nuw nsw i32 %8, 1, !dbg !398
  %10 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #11, !dbg !399, !range !36
  %11 = zext i32 %10 to i64, !dbg !404
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #11, !dbg !406, !range !45
  %13 = zext i32 %12 to i64, !dbg !410
  %14 = mul nuw nsw i64 %11, %13, !dbg !412
  %15 = zext i32 %9 to i64, !dbg !414
  %16 = add nuw nsw i64 %14, %15, !dbg !416
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y() #11, !dbg !418, !range !25
  %18 = add nuw nsw i32 %17, 1, !dbg !424
  %19 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() #11, !dbg !425, !range !70
  %20 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y() #11, !dbg !430, !range !45
  %narrow = mul nuw nsw i32 %19, %20, !dbg !434
  %narrow37 = add nuw nsw i32 %18, %narrow, !dbg !436
  %21 = zext i32 %narrow37 to i64, !dbg !436
  %22 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z() #11, !dbg !438, !range !87
  %23 = add nuw nsw i32 %22, 1, !dbg !444
  %24 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() #11, !dbg !445, !range !70
  %25 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z() #11, !dbg !450, !range !100
  %narrow38 = mul nuw nsw i32 %24, %25, !dbg !454
  %narrow39 = add nuw nsw i32 %23, %narrow38, !dbg !456
  %26 = zext i32 %narrow39 to i64, !dbg !456
  %27 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract13, i64 noundef 0) #11, !dbg !458
  %28 = call i64 @llvm.smax.i64(i64 %.fca.2.1.extract15, i64 noundef 0) #11, !dbg !458
  %29 = add nsw i64 %21, -1, !dbg !469
  %30 = add nsw i64 %26, -1, !dbg !474
  %31 = mul i64 %28, %30, !dbg !477
  %reass.add = add i64 %31, %29
  %reass.mul = mul i64 %reass.add, %27
  %32 = add i64 %reass.mul, %16, !dbg !478
  %33 = call i64 @llvm.smax.i64(i64 %.fca.3.extract17, i64 noundef 0) #11, !dbg !479
  %34 = add i64 %32, -1, !dbg !489
  %.not = icmp ult i64 %34, %33, !dbg !491
  br i1 %.not, label %L95.i, label %L96.i, !dbg !492

L95.i:                                            ; preds = %entry
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, 2, 0, !dbg !390
  %.fca.0.extract9 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0, !dbg !390
  %".fca.0.extract9'ipev" = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %"'1", 0, !dbg !493
  %"'ipc" = bitcast i8 addrspace(1)* %".fca.0.extract9'ipev" to float addrspace(1)*, !dbg !493
  %35 = bitcast i8 addrspace(1)* %.fca.0.extract9 to float addrspace(1)*, !dbg !493
  %"'ipg" = getelementptr inbounds float, float addrspace(1)* %"'ipc", i64 %34, !dbg !493
  %36 = getelementptr inbounds float, float addrspace(1)* %35, i64 %34, !dbg !493
  %37 = load float, float addrspace(1)* %36, align 4, !dbg !493, !tbaa !174, !alias.scope !499, !noalias !502
  %38 = call float @__nv_ceilf(float %37) #11, !dbg !504
  %39 = shl nuw nsw i64 %16, 2, !dbg !505
  %40 = shl nuw nsw i64 %29, 1, !dbg !505
  %41 = add nsw i64 %26, -4, !dbg !505
  %42 = add nsw i64 %41, %39, !dbg !506
  %43 = add nuw nsw i64 %42, %40, !dbg !506
  %44 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #11, !dbg !508
  %45 = add nsw i64 %43, -1, !dbg !521
  %.not40 = icmp ult i64 %45, %44, !dbg !523
  br i1 %.not40, label %julia_testKern_4147_inner.exit, label %L125.i, !dbg !524

L96.i:                                            ; preds = %entry
  call fastcc void @julia__throw_boundserror_4174() #12, !dbg !492
  unreachable

L125.i:                                           ; preds = %L95.i
  call fastcc void @julia__throw_boundserror_4177() #12, !dbg !524
  unreachable

julia_testKern_4147_inner.exit:                   ; preds = %L95.i
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %3, 0, !dbg !390
  %".fca.0.extract'ipev" = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %"'3", 0, !dbg !525
  %"'ipc5" = bitcast i8 addrspace(1)* %".fca.0.extract'ipev" to float addrspace(1)*, !dbg !525
  %46 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*, !dbg !525
  %"'ipg6" = getelementptr inbounds float, float addrspace(1)* %"'ipc5", i64 %45, !dbg !525
  %47 = getelementptr inbounds float, float addrspace(1)* %46, i64 %45, !dbg !525
  store float %38, float addrspace(1)* %47, align 4, !dbg !525, !tbaa !174, !alias.scope !531, !noalias !534
  br label %invertjulia_testKern_4147_inner.exit, !dbg !390

invertentry:                                      ; preds = %invertL95.i
  ret void

invertL95.i:                                      ; preds = %invertjulia_testKern_4147_inner.exit
  %48 = load float, float* %"'de", align 4, !dbg !504
  store float 0.000000e+00, float* %"'de", align 4, !dbg !504
  call void inttoptr (i64 139900552914512 to void (i8*)*)(i8* getelementptr inbounds ([14666 x i8], [14666 x i8]* @2, i32 0, i32 0)) #13, !dbg !504
  %49 = load float, float* %"'de4", align 4, !dbg !493
  store float 0.000000e+00, float* %"'de4", align 4, !dbg !493
  %50 = atomicrmw fadd float addrspace(1)* %"'ipg", float %49 monotonic, align 4, !dbg !493
  br label %invertentry

invertjulia_testKern_4147_inner.exit:             ; preds = %julia_testKern_4147_inner.exit
  %51 = load float, float addrspace(1)* %"'ipg6", align 4, !dbg !525, !tbaa !174, !alias.scope !536, !noalias !537
  store float 0.000000e+00, float addrspace(1)* %"'ipg6", align 4, !dbg !525, !tbaa !174, !alias.scope !536, !noalias !537
  %52 = load float, float* %"'de", align 4, !dbg !525
  %53 = fadd fast float %52, %51, !dbg !525
  store float %53, float* %"'de", align 4, !dbg !525
  br label %invertL95.i
}

ERROR: InvalidIRError: compiling MethodInstance for testKernDeff(::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] #ceil
    @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
  [2] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:31
  [3] testKern
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [4] diffejulia_testKern_4147_inner2wrap
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
  [5] macro expansion
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:5285
  [6] enzyme_call
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4963
  [7] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/x7HCA/src/compiler.jl:4905
  [8] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:366
  [9] autodiff_deferred
    @ ~/.julia/packages/Enzyme/x7HCA/src/Enzyme.jl:436
 [10] testKernDeff
    @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:39
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
 [1] #ceil
   @ ~/.julia/packages/CUDA/35NC6/src/device/intrinsics/math.jl:275
 [2] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:31
 [3] testKern
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
 [4] diffejulia_testKern_4147_inner2wrap
   @ /media/jm/hddData/projects/superVoxelJuliaCode/superVoxelJuliaCode/src/old/chainRulesLuxExample/kernelAe copy.jl:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:414 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:83 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:129
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:106
  [9] compile
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:98 [inlined]
 [10] #1037
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:104 [inlined]
 [11] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:47
 [12] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/compilation.jl:103
 [13] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:125
 [14] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:103
 [15] macro expansion
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:318 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] cufunction(f::typeof(testKernDeff), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:313
 [18] cufunction(f::typeof(testKernDeff), tt::Type{Tuple{…}})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:310
 [19] top-level scope
    @ ~/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:104
Some type information was truncated. Use `show(err)` to see complete types.

@wsmoses
Copy link
Member

wsmoses commented Feb 12, 2024

Should be fixed by #1281 please reopen if not.

@wsmoses wsmoses closed this as completed Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants