Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to DifferentiationInterface #29

Closed
wants to merge 9 commits into from
Closed

Switch to DifferentiationInterface #29

wants to merge 9 commits into from

Conversation

gdalle
Copy link
Contributor

@gdalle gdalle commented May 1, 2024

Fix #26

This PR removes nearly all of the code in the package extensions to rely on DifferentiationInterface.jl instead.

Benefits

Changes (done and planned)

I did my best to make the changes non-breaking but please double-check

  • Define ADgradientDI as a concrete struct with 3 fields:
    • backend: an ADTypes object
    • : needs no introduction
    • extras: stores things like tapes and configs in the DifferentiationInterface.jl paradigm
  • Make the ADgradient function construct an ADgradientDI struct, either from an ADTypes object or from a Symbol (please double-check that I did not forget any translation for Symbols)
  • Keep special cases in 3 extensions (Enzyme, FiniteDifferences, ForwardDiff) whose constructors need specific types
  • Remove irrelevant tests
  • Figure out why Tracker tests are broken
    • "loss is infinite": because x = randn(3) was given as input to a density supported by $\mathbb{R}_+$
    • failed type inference: Tracker is fundamentally type-unstable, added an ugly hack
  • Add documentation

@tpapp
Copy link
Owner

tpapp commented May 1, 2024

Can you explain why we need to remove the old interfaces, instead of deprecating them? Couldn't they just coexist for a while until users adapt their code?

@gdalle
Copy link
Contributor Author

gdalle commented May 1, 2024

The goal of the PR is that from a user perspective, nothing should break (but again, feel free to double check).
Internally, what this does is perform every gradient computation based on the ADTypes object, and so the symbols get mapped to ADTypes (instead of the other way around).

There's indeed another option: only change the behavior when an ADTypes object is passed, and leave the behavior with symbols unchanged. If you want I can do that too

@tpapp
Copy link
Owner

tpapp commented May 1, 2024

Let me review the code in detail and then I will form an opinion about this. It may not be this week as I am very busy with a deadline, but I will do it ASAP.

@gdalle
Copy link
Contributor Author

gdalle commented May 1, 2024

Thanks :)
You can compare it with #30 which is the more conservative option. However, I don't like it as much because

  • it doesn't lead to code deduplication
  • it makes the test file much more complicated since there are different code paths for symbols and ADTypes (instead of one falling back on the other, which I did in the current PR)

Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
@Mohammad-gif
Copy link

Toush'-'@',)california the paradise‹{THAI}›banana`·)(':)–ˊhybrid’™,IBA'Part shak_, onion/europharma-[,[(•)]"-,“(..):-{the:.·ˊˋ·,jazz

@gdalle
Copy link
Contributor Author

gdalle commented May 6, 2024

@devmotion I've made ADTypes and DifferentiationInterface compatible with 1.6, hopefully that's a significant step to facilitate Turing adoption!

@gdalle gdalle requested a review from devmotion May 7, 2024 06:03
Project.toml Outdated Show resolved Hide resolved
@gdalle gdalle requested a review from devmotion May 7, 2024 06:57
@gdalle
Copy link
Contributor Author

gdalle commented May 7, 2024

@devmotion can you approve the workflow to see if the tests pass in CI?

@gdalle
Copy link
Contributor Author

gdalle commented May 8, 2024

Can someone help me figure out what is going wrong with the Tracker tests I marked as broken?

@torfjelde
Copy link
Contributor

torfjelde commented May 8, 2024

Can someone help me figure out what is going wrong with the Tracker tests I marked as broken?

Is this just type inferece, i.e. the @inferred, or is it something else? Do you have a stacktrace of the error? Can't see it in the logs now that it's been disabled.

EDIT: Just saw the description of the issue above.

@gdalle
Copy link
Contributor Author

gdalle commented May 8, 2024

I can reactivate these tests so that they fail

@gdalle
Copy link
Contributor Author

gdalle commented May 8, 2024

Done, if someone approves the workflow again

@gdalle
Copy link
Contributor Author

gdalle commented May 9, 2024

Okay @tpapp I figured out the Tracker thing.
You use a test logdensity that is -Inf for negative input, but then you give x = randn(3) as input sometimes. Other times it's x = rand(3) or x = randexp(3), which makes more sense to me. Is there a rationale behind it?
Tracker.withgradient has a check that errors in that case, but other backends might silently return -Inf, check that -Inf == -Inf and get on with their lives

@gdalle
Copy link
Contributor Author

gdalle commented May 9, 2024

Okay, fixed it by:

  • giving x = randexp(3) to every test (instead of the occasional randn or rand)
  • adding an ugly promotion and type checking hack specifically for Tracker

If you tell me the PR is acceptable in this shape, I'll add proper documentation and ping one last time

@devmotion
Copy link
Collaborator

adding an ugly propotion and type checking hack specifically for Tracker

Probably you ran into https://github.com/FluxML/Flux.jl/issues/497? See

# work around https://github.com/FluxML/Flux.jl/issues/497
z = T <: Real ? zero(T) : 0.0
S = typeof(z + 0.0)
S(yval)::S, (S.(first(Tracker.data.(back(1)))))::Vector{S}
Maybe just use the same fix? I wonder also if this problem should be addressed in DI rather than LogDensityProblemsAD.

@gdalle
Copy link
Contributor Author

gdalle commented May 9, 2024

Yeah it's that one, I updated the PR description to mention it.
I could use a similar fix as Tamas originally did, but I would rather have all gradient computations performed by DI than keep one with a different default mechanism. Indeed, I'd rather remove this hack altogether.
So the question becomes: should we fix this Tracker type inference issue in DI?

My answer would be no. The founding principle of DI would be "the AD backend knows better". If I compute a gradient and it returns some specific type, who am I to say that a promotion with Float64 is the right thing to do? Same with the conversion to Vector. Users will not expect this, and while it may be the right thing to do for Turing, it will be wrong in other scenarios. So in my view, we have two options:

  • keep the hack here (mine is just a thin wrapper over DI, which is why I prefer it to Tamas')
  • give up on testing inferrability for Tracker

@devmotion
Copy link
Collaborator

So the question becomes: should we fix this Tracker type inference issue in DI?

Or probably even better in Tracker.

@gdalle
Copy link
Contributor Author

gdalle commented May 9, 2024

Or probably even better in Tracker.

Yeah but that's not happening, is it?

Copy link
Collaborator

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Tracker issue actually points to a regression caused by the switch to DifferentiationInterface that has to be addressed since otherwise I expect this will cause downstream issues. See my comment in test/runtests.jl for more details.

Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
mode=Reverse,
shadow=nothing,
)
if !isnothing(shadow)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the PR will lead to worse performance in downstream packages that work with pre-allocated/cached shadows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that DifferentiationInterface has its own mechanism for preparing a gradient, which is triggered when you supply the constructor with x.
As you can see here, it does the exact same shadow construction:
https://github.com/gdalle/DifferentiationInterface.jl/blob/16d93ef4111e1c196912a3dd53ffb31e04445324/DifferentiationInterface/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl#L40-L48
Again, the idea is to have a single source code for all of this boilerplate, so that if there is something to improve, it can be improved in DI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed there exists such an internal caching/preparation - but this will still break use cases, or at least lead to worse performance, in cases where currently a shadow is pre-allocated and passed around to multiple ADgradient calls. So from the perspective of ADgradient, ideally it would still be possible to forward a shadow to the construction in DI.

It seems there are more major issues with the Enzyme backend though: #29 (review)

shadow=nothing,
)
if !isnothing(shadow)
@warn "keyword argument `shadow` is now ignored"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put this in a breaking release as suggested above, IMO we should just remove the keyword argument. Alternatively, in a non-breaking release I'd prefer to make it a deprecation warning (possibly one that is forced to be displayed).

Comment on lines -1 to -3
"""
Gradient AD implementation using Enzyme.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be kept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very uninformative, no one will ever see it, and this file no longer contains the actual gradient implementation

for _ in 1:100
x = randn(3)
for _ = 1:100
x = randexp(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x = randexp(3)
x = randn(3)

for _ in 1:100
x = randn(3)
for _ = 1:100
x = randexp(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x = randexp(3)
x = randn(3)

for _ in 1:100
x = randn(3)
for _ = 1:100
x = randexp(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x = randexp(3)
x = randn(3)

for _ in 1:100
x = randn(3)
for _ = 1:100
x = randexp(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x = randexp(3)
x = randn(3)

test/runtests.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@wsmoses wsmoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Enzyme rule here should not be moved to DI.

Specifically as DI does not support multi-arg functions and in particular multiple partial activities (e.g. some arguments differentiated wrt, others not), this will cause downstream issues for users who previously had working code.

If I'm reading this correctly, at minimum this will require the creation of a closure around logdensity + ℓ . This closure may cause a performance decrement, and possibly make code which was previously differentiable, no longer differentiable.

∇ℓ::ADgradientDI{B,<:Any,Nothing},
x::AbstractVector{T},
) where {B,T}
y, g = DI.value_and_gradient(Base.Fix1(logdensity, ∇ℓ.ℓ), ∇ℓ.backend, x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsmoses the closure you mention is created here, it's a Base.Fix1. If it isn't fully type-inferrable, would a custom callable struct do the trick?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this was the PR in which the initial implementation with Fix1 was fixed: #4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, can anyone explain to me the justification for that PR instead of acting like it should be obvious ^^? I'm sorry, really doing my best here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no PR description, no underlying issue, no indicative comments, just the change in the code, and I'm struggling to grasp why it was so important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was asleep til now (on pacific time).

Closures can present several issues both for Enzyme's ability to AD something, the performance of differentiated programs, and even the performance of programs not being differniated.

I'll write up a few concerns below, in no particular order:

  • Type stability. Since you mention this already I won't dive more into.

  • Boxing. If a variable is modified in a closure, it may end up being boxed by julia. In essence Julia will force it to be captured by reference. Even if the function is otherwise type stable, a box is essentially a type unstable any type, even it otherwise would be type stable. Furthermore an unnecessary box will add a level of indirection which hurts performance (see below)

julia> function h(x,y)
           function g(z)
               x[y] *= 2; y+=1; return z*x[y]
           end
           return g
       end
h (generic function with 1 method)

julia> h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)

julia> fieldtypes(typeof(h([1,2,3],1)))
(Vector{Int64}, Core.Box)

julia> fn = h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)

julia> fn(1)
2

julia> fn(1)
3
  • Indirection / Aliasing: Indirection can cause overhead. For example consider an array of ints vs an array of Ref{Int} (notably both being type stable). The first is much faster than the other.
julia> function sumvals(x)
           res = 0
           @simd for i in 1:length(x)
             res += @inbounds x[i]
           end
           return res
       end^C

julia> function sumvals_ref(x)
           res = 0
           @simd for i in 1:length(x)
             res += @inbounds x[i][]
           end
           return res
       end
sumvals_ref (generic function with 1 method)

julia> array = collect(1:10000); 

julia> rarray = collect(map(Ref, 1:10000));

julia> @btime sumvals(array)
  595.877 ns (1 allocation: 16 bytes)
50005000

julia> @btime sumvals_ref(rarray)
  3.534 μs (1 allocation: 16 bytes)
50005000

There are many reasons for this. First of all there's one less load instruction per loop in the top (the first loop needs to load the data from the array, the second loads a ref from the array, then the float from the ref pointer). Secondly, this data is layed out consecutively. This allows vector instructions to apply on the first, whereas it is impossible in the second (vector loads/stores require consecutive data). Secondly you get much better cache performance. In the second case you can get a cache miss per iteration. In the first case you get a cache hit only once per cache size. (since a single cache hit loads a consecutive cache size number of elements all at once).

Beyond the extra information, closures can obscure aliasing information. Aliasing is the ability to say that two pointers point to different memory. This is critical for optimization. Consider the following.

@noinline function inner(y)
   @show y
end

function twouse(x, y)
  x1 =  x * 2
  inner(y)
  x2 = x * 3
  return x1 + x2
end

julia> @code_llvm twouse(2, 3)
;  @ REPL[30]:1 within `twouse`
define i64 @julia_twouse_778(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[30]:3 within `twouse`
  %2 = call i64 @j_inner_780(i64 signext %1)
;  @ REPL[30]:5 within `twouse`
; ┌ @ int.jl:87 within `+`
   %3 = mul i64 %0, 5
; └
  ret i64 %3
}

Here the two uses of x get optimized into one, fantastic. However if now we have it go through a pointer (via a Ref)

@noinline function inner_ref(y)
   @show y[]
end

function twouse_ref(x, y)
  x1 =  x[] * 2
  inner(y)
  x2 = x[] * 3
  return x1 + x2
end

julia> @code_llvm twouse_ref(Ref(2), Ref(3))
;  @ REPL[33]:1 within `twouse_ref`
define i64 @julia_twouse_ref_790({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
;  @ REPL[33]:2 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %2 = bitcast {}* %0 to i64*
    %3 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
   %4 = shl i64 %3, 1
; └
;  @ REPL[33]:3 within `twouse_ref`
  %5 = call nonnull {}* @j_inner_792({}* nonnull %1)
;  @ REPL[33]:4 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %6 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
   %7 = mul i64 %6, 3
; └
;  @ REPL[33]:5 within `twouse_ref`
; ┌ @ int.jl:87 within `+`
   %8 = add i64 %7, %4
; └
  ret i64 %8
}

You'll note that now the two loads of x[] are there explicitly and not combined into one. This is because the call to inner could possibly write to the same memory of x, making it possible that it changed the value of x inside the function (even though here we know it doesn't). Getting rid of the inner function call indeed does let a single load replace both uses.

julia> function twouse_ref2(x, y)
         x1 =  x[] * 2
         x2 = x[] * 3
         return x1 + x2
       end
twouse_ref2 (generic function with 1 method)

julia> @code_llvm twouse_ref2(Ref(2), Ref(3))
;  @ REPL[35]:1 within `twouse_ref2`
define i64 @julia_twouse_ref2_793({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
;  @ REPL[35]:2 within `twouse_ref2`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %2 = bitcast {}* %0 to i64*
    %3 = load i64, i64* %2, align 8
; └└
;  @ REPL[35]:4 within `twouse_ref2`
; ┌ @ int.jl:87 within `+`
   %4 = mul i64 %3, 5
; └
  ret i64 %4
}

Introducing a closure means that there is significant worse aliasing information since each of the captured variables is behind a pointer, and any aliasing information (e.g. variables x and y don't overlap) is now lost.

This makes the regular code slow, and can also make the genereated AD code much slower. As the various enzyme papers have shown running optimization before AD makes a compound impact on the generated code. In this case worse aliasing information may mean that Enzyme has to cache a lot more data, dramatically slowing things down and possibly preventing previously differntiable code from being able to be run at all (if available memory is insufficient).

  • Mixed activity. Just like how closures can introduce type instability, they can introduce activity instability. In this case the question is whether a variable is differentiable or not. If one stores a constant vector into a struct alongside other duplicated vectors that will create a mixed activity error (since there is no derivative vector for that which can be defined consistently -- in the worst case leading to wrong results if done incorrectly, so instead here we error). Notably the user can enable runtime activity to work around this like described in our docs, but this will come with a performance penalty (again again provide some combination of previously working code no longer working and slowdown).

  • Unnecessary differentiation. Multiple arguments allows you to mark some data as non-differentiable, which feeds an analysis (activity analysis) which determines the minimum number of instructions/values to differentiate. Combining a bunch of variables together (e.g. in a struct/closure), and in particular no longer specify some as being non-differentiated, can prevent the analysis from proving some code doesn't need to be differentiated. If that code isn't presently supported (e.g. due to a type instability in the function being differentiated), you'll now error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go into more detail, but hopefully that is a reasonable explanation to start with. Closures were one of the reasons imo that AD.jl didn't take off, so long term I think DI.jl would be well served by both not introducing them itself (my understanding is that it does this successfully), but also not forcing users to create the same closures user-side (which would create similar issues).

I think it would be useful to revisit this down the line after DI adds support, for example, for multiple distinct arguments, as well as the ability to enable/disable certain variables from being differentiated, since both of those features are in use here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact to help avoid the issues from closures generated by threading, its little-known but Enzyme even vendors a closure-free (and differentiable) threading macro: https://github.com/EnzymeAD/Enzyme.jl/blob/75e5311ac0dda08498fcfb0aa54e89930cb3e1c9/test/threads.jl#L113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation. DI is unlikely to support either multiple arguments or activity specification anytime soon

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsmoses the explanation above should end up in the Enzyme manual IMO, it contains a lot of relevant information.

@gdalle
Copy link
Contributor Author

gdalle commented May 13, 2024

Okay so given the feedback by @devmotion and @wsmoses, it seems we need

  • a Tracker special case for non finite values
  • an Enzyme special case to avoid the closure
  • maybe other special cases for backends that can differentiate wrt several arguments (Zygote and Tracker)?

If I add both, do you think this PR is mergeable? And more importantly, do you think it is useful?

It took a significant effort to make DI 1.6-compatible for this, so I don't want to keep working only to discover unexpected hurdles at each review

@devmotion
Copy link
Collaborator

It took a significant effort to make DI 1.6-compatible for this, so I don't want to keep working only to discover unexpected hurdles at each review

I'm sorry but at least I didn't anticipate these problems in the initial reviews.

I'm not sure how useful it is to switch LogDensityProblemsAD over to DI if so many workarounds/special cases are needed. I'd suggest putting the PR on hold until these issues are fixed upstream - I hope we can remove the losscheck in Tracker (or at least remove the error it currently throws) and possibly we can add special cases for Base.Fix1 and Base.Fix2 for Enzyme to DI (or Enzyme itself?)?

@gdalle
Copy link
Contributor Author

gdalle commented May 13, 2024

Alright then :) Switching the status to draft for now

@gdalle gdalle marked this pull request as draft May 13, 2024 19:15
@wsmoses
Copy link
Contributor

wsmoses commented May 13, 2024

I don't think this is something with a fix in Enzyme itself since Enzyme already supports multiple arguments, including marking some arguments as non differentiable without issue.

The fix here has to come from DI adding support for multiple arguments to since otherwise a closure will have to be created somewhere.

Copy link
Owner

@tpapp tpapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to this PR, I had a project with a deadline.

Generally, it is my impression that while DifferentiationInterface may be ready to replace some AD backends, some edge cases need to be worked out for a few of them, in particular ForwardDiff.

IMO a version of this PR would be useful if it just replaced the straightforward cases where equivalent functionality could be achieved without breaking changes, then the rest could be addressed later on.

OTOH, I am not sure that there is a pressing need to replace anything. Yes, this package by necessity and historical reasons duplicates a lot of functionality in an abstract AD metapackage. This was made much easier by the fact that we only care about $\mathbb{R}^n \to \mathbb{R}$ functions. But the code is already there and in most cases it works fine.

Comment on lines +13 to +16
_get_chunksize(::Chunk{C}) where {C} = C
_get_chunksize(chunk::Integer) = chunk

_default_chunk(ℓ) = _get_chunksize(dimension(ℓ))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK

  1. the only AD backend that uses a chunk size is ForwardDiff,
  2. it is enforced to be lower than ForwardDiff.DEFAULT_CHUNK_THRESHOLD, which is 12 unless overridden by a loaded preference

so a chunk size specified here bears a somewhat indirect relation to the actual chunk size used. IMO many users are now aware of this.

The ideal interface would need to mention this to manage user expectations. Eg using dimension(ℓ) does not do that it seemingly does in most cases.

∇ℓ::ADgradientDI{B,<:Any,Nothing},
x::AbstractVector{T},
) where {B,T}
y, g = DI.value_and_gradient(Base.Fix1(logdensity, ∇ℓ.ℓ), ∇ℓ.backend, x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsmoses the explanation above should end up in the Enzyme manual IMO, it contains a lot of relevant information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can DifferentiationInterface.jl replace the AD backend extensions here?
6 participants