Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AD comparison #988

Merged
merged 4 commits into from
Jul 25, 2024
Merged

Update AD comparison #988

merged 4 commits into from
Jul 25, 2024

Conversation

gdalle
Copy link
Contributor

@gdalle gdalle commented Jun 27, 2024

Update AD benchmark to the latest versions of DifferentiationInterface

For some reason Tapir fails, will investigate

@ChrisRackauckas
Copy link
Member

What about just keeping Tapir on the earlier version for now?

@gdalle
Copy link
Contributor Author

gdalle commented Jul 3, 2024

I opened an issue at Tapir.jl, let's see what @willtebbutt thinks
compintell/Mooncake.jl#193

@wsmoses
Copy link

wsmoses commented Jul 4, 2024

Since it sounds like per JuliaDiff/DifferentiationInterface.jl#339 and others things like closures DI is likely to have a nontrivial performance loss on Enzyme, can we just add Enzyme.gradient/autodiff benchmkars here directly

@gdalle
Copy link
Contributor Author

gdalle commented Jul 4, 2024

The benchmarking utilities of DifferentiationInterfaceTest are one of the main appeals of the package, and so this notebook is a benchmark of AD backends as called through DI. I think it can also be useful for non-advanced users who want to use DI precisely because it is simple and backend-agnostic. Thus I would recommend we keep this page as it is, possibly with a warning.

And I am aware that you're not happy about Enzyme being possibly a bit slower in DI @wsmoses ;). I just ask that you give me time to come up with a solution that makes everyone happy. In the meantime, I'm adding a warning to our docs: JuliaDiff/DifferentiationInterface.jl#344

@ChrisRackauckas
Copy link
Member

can we just add Enzyme.gradient/autodiff benchmkars here directly

We should measure the difference, that would make the discussions much more concrete.

@wsmoses
Copy link

wsmoses commented Jul 4, 2024

Yeah I think having both here would be useful. In this function in particular I imagine the perf would be mostly the same (given its already a single arg fn without a closure), but we could see how things actually shake up for the things being discussed

@wsmoses
Copy link

wsmoses commented Jul 4, 2024

this notebook is a benchmark of AD backends as called through DI.

Oh okay. I misunderstood this as a benchmark set of AD packages in general, rather than a DI-specific benchmark. If this is the case, maybe this should have a warning at the top saying that "This benchmark set is for testing the performance of DI and will have performance that may differs from using AD packages directly. If you are looking for a benchmark of different AD backends please see "

@gdalle
Copy link
Contributor Author

gdalle commented Jul 4, 2024

I'm okay with things going either way. I agree that having a comparison with "native" Enzyme would make sense, especially in some complicated cases like closures. But I have already done the work of coding this backend-agnostic benchmarking pipeline, and so far it is the only thing close to a standard that we have.

So my suggestion is the following: if AD developers are unhappy with the performance of their backend on a given benchmark, they are free to contribute the necessary code to add rows to the DataFrame in this notebook. Of course said rows should contain the exact same quantities that DifferentiationInterfaceTest computes by itself, to enable comparison. In other words, now that there is a standard, if you want to beat the DI implementation, the burden falls on you to play by the same rules. Or better yet, to contribute ideas to DI and help me fix performance pitfalls. What do you think?

@wsmoses
Copy link

wsmoses commented Jul 8, 2024

@gdalle is the DIT results basically compatible with benchmarktools? In which case I imagine it should be fairly simple to just add a @ benchmark / @ btime like below.

As for the tangible perf implications, here's an example. Asking for the closure derivatives, like in JuliaDiff/DifferentiationInterface.jl#341 will cause the derivative code to become arbitrarily asymptotically worse (here demonstrated going from O(1) -> O(N)). The linked PR failed to compile for me so I demonstrate with what the Enzyme calls would be directly.

As a top level comment, I simultaneously think the work you've done is immensely valuable (especially here to showcase a comparision), and also think it is important to provide reasonable default benchmarks/examples/docs/etc. People look to docs/benchmarks as examples as how things should be written/used and showing them something suboptimal or with an error has led to issues for Julia in the past (e.g. there was a Julia language doc for inbounds that itself misused inbounds, leading to lots of headaches -- and in various code examples of people using type unstable global variables leading to the use of type unstable global variables in a lot more places). If this is meant to be something which represents how to use AD in Julia [in contrast to how to use DI], I think it is important that we give examples that show how to use these tools most effectively (as well as how to use these tools most easily, thanks to DI!).

struct MyClosure
	y::Vector{Float64}
end

function (c::MyClosure)(x)
	sum(c.y) + x
end

y = MyClosure(rand(10000))


using Enzyme
using DifferentiationInterface
using BenchmarkTools

@btime Enzyme.autodiff(Reverse, y, Active(1.0))
#  36.380 ns (1 allocation: 16 bytes)


@btime Enzyme.autodiff(Reverse, Duplicated(y, make_zero(y)), Active(1.0))
#   3.422 μs (11 allocations: 78.67 KiB)

using DifferentiationInterfaceTest
backends = [
    AutoEnzyme(mode=Enzyme.Reverse),
];

scenarios = [
    GradientScenario(x->y(x[1]); x=[1.0], y=1.0, nb_args=1),
];

# Using DI main 
data = benchmark_differentiation(backends, scenarios, logging=true)

# 3×11 DataFrame
#  Row │ backend                            scenario                           operator             calls  samples  evals  time      allocs   bytes    gc_fraction  compile_fraction 
#      │ AutoEnzy…                          Scenario…                          Symbol               Int64  Int64    Int64  Float64   Float64  Float64  Float64      Float64          
# ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#    1 │ AutoEnzyme(mode=ReverseMode{fals…  Scenario{:gradient,1,:inplace} #…  prepare_gradient         0        1      1  0.0           0.0      0.0          0.0               0.0
#    2 │ AutoEnzyme(mode=ReverseMode{fals…  Scenario{:gradient,1,:inplace} #…  value_and_gradient!      1    20198      1  3.375e-6     20.0    544.0          0.0               0.0
#    3 │ AutoEnzyme(mode=ReverseMode{fals…  Scenario{:gradient,1,:inplace} #…  gradient!                1    20006      1  3.375e-6     11.0    288.0          0.0               0.0


# Using Rebase of https://github.com/gdalle/DifferentiationInterface.jl/pull/341
# julia> using DifferentiationInterface, Enzyme
# Precompiling DifferentiationInterfaceEnzymeExt
#         Info Given DifferentiationInterfaceEnzymeExt was explicitly requested, output will be shown live 
# ERROR: LoadError: UndefVarError: `@compat` not defined

@gdalle
Copy link
Contributor Author

gdalle commented Jul 15, 2024

Thanks Billy for the example, which indeed gives proof that JuliaDiff/DifferentiationInterface.jl#339 needs a little more thought.
In the meantime, DIT benchmarks are not done with BenchmarkTools but with Chairmarks, so the fields are slightly different.

@ChrisRackauckas
Copy link
Member

You're mixing two related but different concepts. This has nothing to do with JuliaDiff/DifferentiationInterface.jl#339. JuliaDiff/DifferentiationInterface.jl#339 needs to be done for correctness reasons. If you check

@btime Enzyme.autodiff(Reverse, y, Active(1.0))

it simply gives the wrong derivatives in many cases. Yes it's fast, but:

function (c::MyClosure)(x)
       c.y[1] = x
	sum(c.y) + x
end

you'll get the wrong answer with @btime Enzyme.autodiff(Reverse, y, Active(1.0)). So there's two separate things here. One is that, DI should get the right answer. It can have a fast mode for duplicate_closure = Val(false) to remove the closure duplications as a performance operation, but unless you can prove it's correct you must ensure those values are differentiated. In the most common Julia use cases, which is using closures to enclose caches for temporary variables, you will get the wrong answer if the user sets this to false, and so it should be documented appropriately.

The second is, DI should not build closures and instead should pass in const arguments so that what it's using for caching can parameters is not caught up in the closure duplication. That's the multi-argument form, and having this be a two argument function with a const on the second is the way to ensure it's clear not to differentiate the argument here.

But I would be very thinking an approach is a "performance optimization" when it makes a very common use case simply compute the wrong answer.

@gdalle
Copy link
Contributor Author

gdalle commented Jul 15, 2024

My thinking was that having multiple arguments would solve both problems at once, but you're right, people may still want to put stuff in the function f itself and I can't stop them.
Do we actually pay a performance penalty for Duplicated(f, make_zero(f)) if f is not a closure, and contains no differentiable storage?

@ChrisRackauckas
Copy link
Member

Do we actually pay a performance penalty for Duplicated(f, make_zero(f)) if f is not a closure, and contains no differentiable storage?

There is no performance penalty in this case. There is a bug I'm trying to isolate, see SciML/SciMLSensitivity.jl#1067 , but I can verify that it's zero overhead.

@wsmoses
Copy link

wsmoses commented Jul 15, 2024

The case I'd worry about isn't the ghost type closure case, but one where there's inactive data being closed over. For example a model, where you just want the loss wrt the input and not the model [this would end up computing the derivative wrt everything].

And I do think that this is related to the multiple arguments/activity support.

For example, @ChrisRackauckas your use case could be done by doing something like.

Enzyme.gradient(tup->tup[1](tup[2]), (myclosure, x))

However if the function were always differentiable it would be impossible to represent the case where you don't differentiate wrt the function/model/etc, which is the default of other AD tools.

I agree DI should support both, but in the absence of an option, in my mind is a matter of giving both reasonable defaults (e.g. what is expected from other AD tools). You can always use Enzyme directly.

@gdalle
Copy link
Contributor Author

gdalle commented Jul 15, 2024

For multiple arguments and activities, check out my proposal in JuliaDiff/DifferentiationInterface.jl#311 (comment), I think I'm getting closer to an interesting interface.

As for the default behavior, I think most other backends consider the function constant, so in that sense keeping the current behavior for Enzyme seems reasonable indeed, as long as I warn about it in the docs. Once multiple arguments are supported, closing over a cache is no longer necessary.

@ChrisRackauckas
Copy link
Member

However if the function were always differentiable it would be impossible to represent the case where you don't differentiate wrt the function/model/etc, which is the default of other AD tools.

The issue is that other AD backends throw an error. For example, we have this in the DiffEq FAQ because this comes up often.

https://docs.sciml.ai/DiffEqDocs/stable/basics/faq/#I-get-Dual-number-errors-when-I-solve-my-ODE-with-Rosenbrock-or-SDIRK-methods

People will differentiate a code like that and get and error saying that there is a conversion of Float64 to Dual, undefined, etc. etc. and they will then ask or read the FAQ and then use PreallocationTools.jl to fix it.

The issue with non-duplicated closures is that you simply get the wrong answer with no warning or error. SciML/SciMLSensitivity.jl#962 was done because on those examples if it's not done, you just get the wrong derivative.

I would be okay with the approach of not handling it by default if there was an error or warning to disambiguate const vs active, but you really can't just assume that every user's function won't have active elements because that's not true, and it's so not true that we specifically have a section of the FAQ dedicated to it. DI shouldn't make closures and differentiate what's not needed, but making the assumption that the user's function is not active is just not always right, and it's also silent about the fact that it's not always right.

@gdalle
Copy link
Contributor Author

gdalle commented Jul 15, 2024

But if we clearly document in DI that f should not enclose anything but constant parameters, it's not really incorrect, right? It's just a caveat that users have to know about.
My issue is the consistency of semantics: Billy is right, I think no other backend handles closures with differentiable caches correctly. So why would Enzyme be held to a higher standard?

@ChrisRackauckas
Copy link
Member

I think no other backend handles closures with differentiable caches correctly. So why would Enzyme be held to a higher standard?

Every other backend throws an error though. I would rank it as:

  1. Supports it out of the box
  2. Throws an error out of the box, requiring PreallocationTools as a workaround
  3. Silently computes the wrong answer if you happen to use closures in a model as a cache for active values

All other AD backends do (2). Enzyme supports (1) or (3), but does not currently support (2) at least to my knowledge. Choosing 3 is not up to the standard we put on the other AD libraries.

I understand wanting to go fast, but what about clearly documenting a proposed speedup boolean, so AutoEnzyme(skip_closure = true), clearly documented as a "this can make some codes go faster, but with caution as to when it would make the wrong values appear? It wouldn't even effect most benchmarks, because most code either won't use closures or they are enclosing caches so it's required to be true for correctness anyways. The case where you don't need to differentiate some enclosed value because it's constant, but it's also a major performance hit, is such a weird edge case to willingly throw out the the property of "we try to be (mathematically) correct or we error". IMO, that should be the first principle an AD library tries to adhere to, then a second property of "fastest possible thing without tweaking the defaults" is also a good property to me comes second.

@gdalle
Copy link
Contributor Author

gdalle commented Jul 15, 2024

Yeah I had thought of amending ADTypes too. Perhaps AutoEnzyme(constant_function=true/false) would be clearer?

@wsmoses
Copy link

wsmoses commented Jul 15, 2024

I think an ADTypes arg is a good solution to this (also useful for others outside DI like ScimlSensitiviry etc).

@ChrisRackauckas i completely agree with your comments/ordering, though I am going to push back on Enzyme giving an incorrect result. A user tells Enzyme if any var (including the function) is differentiable. Enzyme gives a correct answer in both cases, subject to the problem it was given (eg considering uses of the function to be constant mean any dataflow through a constant has 0 derivative). Of course there is still an issue here of “what problem do we want to compute” for which picking a non differentiable closure when we do want differentiable data flow is not correct. But the error is not in Enzyme, but on what picked the wrong Enzyme call.

By a similar token if DI specified that the function is always non differentiable, it is not a bug in DI since that is using DI for a problem out of spec. The question at hand, however, is what should the spec be (eg a question for users of DI, one within DI, etc)

@ChrisRackauckas
Copy link
Member

But the error is not in Enzyme, but on what picked the wrong Enzyme call.

I agree here. It's not an issue with Enzyme, but:

By a similar token if DI specified that the function is always non differentiable, it is not a bug in DI since that is using DI for a problem out of spec. The question at hand, however, is what should the spec be (eg a question for users of DI, one within DI, etc)

I think that's a misunderstanding of how DI is expected to be used. If we expect DI to be used in packages, say DiffEq, NonlinearSolve, Turing, etc. then you can't expect most users have read the documentation of DI at all. In fact, I'd argue that most users of DI won't even read most of the documentation either (unless there's an error message that tells them to read a specific page). The issue that I'm seeing is that DI as a general differentiation interface is planned to be used pervasively, but if it makes a choice like this, then this is a choice made for downstream as well. And it would be very confusing to a Turing user if they used a mutable cache to make their code non-allocating with pre-allocated buffers because this blog post https://viralinstruction.com/posts/optimise/#cut_down_on_allocations said to do that, and now their code does not converge correctly. At least today, a user of Turing will get an error message

@ChrisRackauckas
Copy link
Member

On this PR though, @gdalle should we merge with AutoTapir giving NaNs?

@gdalle
Copy link
Contributor Author

gdalle commented Jul 16, 2024

I don't know. The Tapir issue is related to both DITest and Tapir using JET as a dependency, and some weird world age shenanigans happening (see compintell/Mooncake.jl#193). @willtebbutt your call

@willtebbutt
Copy link
Contributor

I'm working on resolving this by removing JET as a dep of Tapir.jl. I'm expecting to have something merged today, then hopefully this will all work nicely, and you can use the most recent version of Tapir.jl.

@willtebbutt
Copy link
Contributor

@gdalle could you please try restricting the version of Tapir to 0.2.26 and up, and see what happens? It should work now.

@ChrisRackauckas
Copy link
Member

I updated the Project toml and rebased.

@willtebbutt
Copy link
Contributor

Does it usually take an hour to find an agent for the AD testing?

@ChrisRackauckas
Copy link
Member

The sparse matrix run is 3 days, so that's occupying one. Then there's the bump here https://buildkite.com/julialang/scimlbenchmarks-dot-jl/builds/2671#0190dd16-0e34-4ef6-bb22-a261e6ae81d9 which should probably be done in a few hours.

@ChrisRackauckas
Copy link
Member

It looks like this went through and we're getting numbers. Merging. I'll probably tag a new set on the benchmarks soon too, so this might be the first tagged version.

@ChrisRackauckas ChrisRackauckas merged commit 965750d into master Jul 25, 2024
2 checks passed
@ChrisRackauckas ChrisRackauckas deleted the gd/update_ad branch July 25, 2024 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants