-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to DifferentiationInterface #29
Conversation
Can you explain why we need to remove the old interfaces, instead of deprecating them? Couldn't they just coexist for a while until users adapt their code? |
The goal of the PR is that from a user perspective, nothing should break (but again, feel free to double check). There's indeed another option: only change the behavior when an ADTypes object is passed, and leave the behavior with symbols unchanged. If you want I can do that too |
Let me review the code in detail and then I will form an opinion about this. It may not be this week as I am very busy with a deadline, but I will do it ASAP. |
Thanks :)
|
Toush'-'@',)california the paradise‹{THAI}›banana`·)(':)–ˊhybrid’™,IBA'Part shak_, onion/europharma-[,[(•)]"-,“(..):-{the:.·ˊˋ·,jazz |
@devmotion I've made ADTypes and DifferentiationInterface compatible with 1.6, hopefully that's a significant step to facilitate Turing adoption! |
@devmotion can you approve the workflow to see if the tests pass in CI? |
Can someone help me figure out what is going wrong with the Tracker tests I marked as broken? |
Is this just type inferece, i.e. the EDIT: Just saw the description of the issue above. |
I can reactivate these tests so that they fail |
Done, if someone approves the workflow again |
Okay @tpapp I figured out the Tracker thing. |
Okay, fixed it by:
If you tell me the PR is acceptable in this shape, I'll add proper documentation and ping one last time |
Probably you ran into https://github.com/FluxML/Flux.jl/issues/497? See LogDensityProblemsAD.jl/ext/LogDensityProblemsADTrackerExt.jl Lines 40 to 43 in 449e566
|
Yeah it's that one, I updated the PR description to mention it. My answer would be no. The founding principle of DI would be "the AD backend knows better". If I compute a gradient and it returns some specific type, who am I to say that a promotion with Float64 is the right thing to do? Same with the conversion to Vector. Users will not expect this, and while it may be the right thing to do for Turing, it will be wrong in other scenarios. So in my view, we have two options:
|
Or probably even better in Tracker. |
Yeah but that's not happening, is it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Tracker issue actually points to a regression caused by the switch to DifferentiationInterface that has to be addressed since otherwise I expect this will cause downstream issues. See my comment in test/runtests.jl for more details.
mode=Reverse, | ||
shadow=nothing, | ||
) | ||
if !isnothing(shadow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the PR will lead to worse performance in downstream packages that work with pre-allocated/cached shadows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that DifferentiationInterface has its own mechanism for preparing a gradient, which is triggered when you supply the constructor with x
.
As you can see here, it does the exact same shadow construction:
https://github.com/gdalle/DifferentiationInterface.jl/blob/16d93ef4111e1c196912a3dd53ffb31e04445324/DifferentiationInterface/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl#L40-L48
Again, the idea is to have a single source code for all of this boilerplate, so that if there is something to improve, it can be improved in DI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed there exists such an internal caching/preparation - but this will still break use cases, or at least lead to worse performance, in cases where currently a shadow
is pre-allocated and passed around to multiple ADgradient
calls. So from the perspective of ADgradient
, ideally it would still be possible to forward a shadow
to the construction in DI.
It seems there are more major issues with the Enzyme backend though: #29 (review)
shadow=nothing, | ||
) | ||
if !isnothing(shadow) | ||
@warn "keyword argument `shadow` is now ignored" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we put this in a breaking release as suggested above, IMO we should just remove the keyword argument. Alternatively, in a non-breaking release I'd prefer to make it a deprecation warning (possibly one that is forced to be displayed).
""" | ||
Gradient AD implementation using Enzyme. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be kept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very uninformative, no one will ever see it, and this file no longer contains the actual gradient implementation
for _ in 1:100 | ||
x = randn(3) | ||
for _ = 1:100 | ||
x = randexp(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = randexp(3) | |
x = randn(3) |
for _ in 1:100 | ||
x = randn(3) | ||
for _ = 1:100 | ||
x = randexp(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = randexp(3) | |
x = randn(3) |
for _ in 1:100 | ||
x = randn(3) | ||
for _ = 1:100 | ||
x = randexp(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = randexp(3) | |
x = randn(3) |
for _ in 1:100 | ||
x = randn(3) | ||
for _ = 1:100 | ||
x = randexp(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = randexp(3) | |
x = randn(3) |
Co-authored-by: David Widmann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Enzyme rule here should not be moved to DI.
Specifically as DI does not support multi-arg functions and in particular multiple partial activities (e.g. some arguments differentiated wrt, others not), this will cause downstream issues for users who previously had working code.
If I'm reading this correctly, at minimum this will require the creation of a closure around logdensity + ℓ . This closure may cause a performance decrement, and possibly make code which was previously differentiable, no longer differentiable.
∇ℓ::ADgradientDI{B,<:Any,Nothing}, | ||
x::AbstractVector{T}, | ||
) where {B,T} | ||
y, g = DI.value_and_gradient(Base.Fix1(logdensity, ∇ℓ.ℓ), ∇ℓ.backend, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wsmoses the closure you mention is created here, it's a Base.Fix1
. If it isn't fully type-inferrable, would a custom callable struct do the trick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this was the PR in which the initial implementation with Fix1 was fixed: #4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, can anyone explain to me the justification for that PR instead of acting like it should be obvious ^^? I'm sorry, really doing my best here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no PR description, no underlying issue, no indicative comments, just the change in the code, and I'm struggling to grasp why it was so important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was asleep til now (on pacific time).
Closures can present several issues both for Enzyme's ability to AD something, the performance of differentiated programs, and even the performance of programs not being differniated.
I'll write up a few concerns below, in no particular order:
-
Type stability. Since you mention this already I won't dive more into.
-
Boxing. If a variable is modified in a closure, it may end up being boxed by julia. In essence Julia will force it to be captured by reference. Even if the function is otherwise type stable, a box is essentially a type unstable any type, even it otherwise would be type stable. Furthermore an unnecessary box will add a level of indirection which hurts performance (see below)
julia> function h(x,y)
function g(z)
x[y] *= 2; y+=1; return z*x[y]
end
return g
end
h (generic function with 1 method)
julia> h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)
julia> fieldtypes(typeof(h([1,2,3],1)))
(Vector{Int64}, Core.Box)
julia> fn = h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)
julia> fn(1)
2
julia> fn(1)
3
- Indirection / Aliasing: Indirection can cause overhead. For example consider an array of ints vs an array of Ref{Int} (notably both being type stable). The first is much faster than the other.
julia> function sumvals(x)
res = 0
@simd for i in 1:length(x)
res += @inbounds x[i]
end
return res
end^C
julia> function sumvals_ref(x)
res = 0
@simd for i in 1:length(x)
res += @inbounds x[i][]
end
return res
end
sumvals_ref (generic function with 1 method)
julia> array = collect(1:10000);
julia> rarray = collect(map(Ref, 1:10000));
julia> @btime sumvals(array)
595.877 ns (1 allocation: 16 bytes)
50005000
julia> @btime sumvals_ref(rarray)
3.534 μs (1 allocation: 16 bytes)
50005000
There are many reasons for this. First of all there's one less load instruction per loop in the top (the first loop needs to load the data from the array, the second loads a ref from the array, then the float from the ref pointer). Secondly, this data is layed out consecutively. This allows vector instructions to apply on the first, whereas it is impossible in the second (vector loads/stores require consecutive data). Secondly you get much better cache performance. In the second case you can get a cache miss per iteration. In the first case you get a cache hit only once per cache size. (since a single cache hit loads a consecutive cache size number of elements all at once).
Beyond the extra information, closures can obscure aliasing information. Aliasing is the ability to say that two pointers point to different memory. This is critical for optimization. Consider the following.
@noinline function inner(y)
@show y
end
function twouse(x, y)
x1 = x * 2
inner(y)
x2 = x * 3
return x1 + x2
end
julia> @code_llvm twouse(2, 3)
; @ REPL[30]:1 within `twouse`
define i64 @julia_twouse_778(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[30]:3 within `twouse`
%2 = call i64 @j_inner_780(i64 signext %1)
; @ REPL[30]:5 within `twouse`
; ┌ @ int.jl:87 within `+`
%3 = mul i64 %0, 5
; └
ret i64 %3
}
Here the two uses of x get optimized into one, fantastic. However if now we have it go through a pointer (via a Ref)
@noinline function inner_ref(y)
@show y[]
end
function twouse_ref(x, y)
x1 = x[] * 2
inner(y)
x2 = x[] * 3
return x1 + x2
end
julia> @code_llvm twouse_ref(Ref(2), Ref(3))
; @ REPL[33]:1 within `twouse_ref`
define i64 @julia_twouse_ref_790({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
; @ REPL[33]:2 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
%2 = bitcast {}* %0 to i64*
%3 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
%4 = shl i64 %3, 1
; └
; @ REPL[33]:3 within `twouse_ref`
%5 = call nonnull {}* @j_inner_792({}* nonnull %1)
; @ REPL[33]:4 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
%6 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
%7 = mul i64 %6, 3
; └
; @ REPL[33]:5 within `twouse_ref`
; ┌ @ int.jl:87 within `+`
%8 = add i64 %7, %4
; └
ret i64 %8
}
You'll note that now the two loads of x[] are there explicitly and not combined into one. This is because the call to inner could possibly write to the same memory of x, making it possible that it changed the value of x inside the function (even though here we know it doesn't). Getting rid of the inner function call indeed does let a single load replace both uses.
julia> function twouse_ref2(x, y)
x1 = x[] * 2
x2 = x[] * 3
return x1 + x2
end
twouse_ref2 (generic function with 1 method)
julia> @code_llvm twouse_ref2(Ref(2), Ref(3))
; @ REPL[35]:1 within `twouse_ref2`
define i64 @julia_twouse_ref2_793({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
; @ REPL[35]:2 within `twouse_ref2`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
%2 = bitcast {}* %0 to i64*
%3 = load i64, i64* %2, align 8
; └└
; @ REPL[35]:4 within `twouse_ref2`
; ┌ @ int.jl:87 within `+`
%4 = mul i64 %3, 5
; └
ret i64 %4
}
Introducing a closure means that there is significant worse aliasing information since each of the captured variables is behind a pointer, and any aliasing information (e.g. variables x and y don't overlap) is now lost.
This makes the regular code slow, and can also make the genereated AD code much slower. As the various enzyme papers have shown running optimization before AD makes a compound impact on the generated code. In this case worse aliasing information may mean that Enzyme has to cache a lot more data, dramatically slowing things down and possibly preventing previously differntiable code from being able to be run at all (if available memory is insufficient).
-
Mixed activity. Just like how closures can introduce type instability, they can introduce activity instability. In this case the question is whether a variable is differentiable or not. If one stores a constant vector into a struct alongside other duplicated vectors that will create a mixed activity error (since there is no derivative vector for that which can be defined consistently -- in the worst case leading to wrong results if done incorrectly, so instead here we error). Notably the user can enable runtime activity to work around this like described in our docs, but this will come with a performance penalty (again again provide some combination of previously working code no longer working and slowdown).
-
Unnecessary differentiation. Multiple arguments allows you to mark some data as non-differentiable, which feeds an analysis (activity analysis) which determines the minimum number of instructions/values to differentiate. Combining a bunch of variables together (e.g. in a struct/closure), and in particular no longer specify some as being non-differentiated, can prevent the analysis from proving some code doesn't need to be differentiated. If that code isn't presently supported (e.g. due to a type instability in the function being differentiated), you'll now error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go into more detail, but hopefully that is a reasonable explanation to start with. Closures were one of the reasons imo that AD.jl didn't take off, so long term I think DI.jl would be well served by both not introducing them itself (my understanding is that it does this successfully), but also not forcing users to create the same closures user-side (which would create similar issues).
I think it would be useful to revisit this down the line after DI adds support, for example, for multiple distinct arguments, as well as the ability to enable/disable certain variables from being differentiated, since both of those features are in use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact to help avoid the issues from closures generated by threading, its little-known but Enzyme even vendors a closure-free (and differentiable) threading macro: https://github.com/EnzymeAD/Enzyme.jl/blob/75e5311ac0dda08498fcfb0aa54e89930cb3e1c9/test/threads.jl#L113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation. DI is unlikely to support either multiple arguments or activity specification anytime soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wsmoses the explanation above should end up in the Enzyme manual IMO, it contains a lot of relevant information.
Okay so given the feedback by @devmotion and @wsmoses, it seems we need
If I add both, do you think this PR is mergeable? And more importantly, do you think it is useful? It took a significant effort to make DI 1.6-compatible for this, so I don't want to keep working only to discover unexpected hurdles at each review |
I'm sorry but at least I didn't anticipate these problems in the initial reviews. I'm not sure how useful it is to switch LogDensityProblemsAD over to DI if so many workarounds/special cases are needed. I'd suggest putting the PR on hold until these issues are fixed upstream - I hope we can remove the |
Alright then :) Switching the status to draft for now |
I don't think this is something with a fix in Enzyme itself since Enzyme already supports multiple arguments, including marking some arguments as non differentiable without issue. The fix here has to come from DI adding support for multiple arguments to since otherwise a closure will have to be created somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late to this PR, I had a project with a deadline.
Generally, it is my impression that while DifferentiationInterface may be ready to replace some AD backends, some edge cases need to be worked out for a few of them, in particular ForwardDiff.
IMO a version of this PR would be useful if it just replaced the straightforward cases where equivalent functionality could be achieved without breaking changes, then the rest could be addressed later on.
OTOH, I am not sure that there is a pressing need to replace anything. Yes, this package by necessity and historical reasons duplicates a lot of functionality in an abstract AD metapackage. This was made much easier by the fact that we only care about
_get_chunksize(::Chunk{C}) where {C} = C | ||
_get_chunksize(chunk::Integer) = chunk | ||
|
||
_default_chunk(ℓ) = _get_chunksize(dimension(ℓ)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK
- the only AD backend that uses a chunk size is ForwardDiff,
- it is enforced to be lower than
ForwardDiff.DEFAULT_CHUNK_THRESHOLD
, which is 12 unless overridden by a loaded preference
so a chunk size specified here bears a somewhat indirect relation to the actual chunk size used. IMO many users are now aware of this.
The ideal interface would need to mention this to manage user expectations. Eg using dimension(ℓ)
does not do that it seemingly does in most cases.
∇ℓ::ADgradientDI{B,<:Any,Nothing}, | ||
x::AbstractVector{T}, | ||
) where {B,T} | ||
y, g = DI.value_and_gradient(Base.Fix1(logdensity, ∇ℓ.ℓ), ∇ℓ.backend, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wsmoses the explanation above should end up in the Enzyme manual IMO, it contains a lot of relevant information.
Fix #26
This PR removes nearly all of the code in the package extensions to rely on DifferentiationInterface.jl instead.
Benefits
LogDensityProblemAD
extension so we can support Turing compintell/Mooncake.jl#95)Changes (done and planned)
ADgradientDI
as a concrete struct with 3 fields:backend
: an ADTypes objectℓ
: needs no introductionextras
: stores things like tapes and configs in the DifferentiationInterface.jl paradigmADgradient
function construct anADgradientDI
struct, either from an ADTypes object or from aSymbol
(please double-check that I did not forget any translation forSymbol
s)x = randn(3)
was given as input to a density supported by