Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

Open
m-laprise opened this issue Feb 9, 2025 · 8 comments
Open

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

m-laprise opened this issue Feb 9, 2025 · 8 comments
Labels
enhancement (performance) Would reduce the time it takes to run some bit of the code

Comments

@m-laprise
Copy link

(This is a follow-up to a slack thread, cc @willtebbutt & @gdalle )

I have been comparing performance of different autodiff backends for training Flux models. I get one order of magnitude worse performance from Mooncake compared to Enzyme and even Zygote in most cases. This may be due to the Fluxperimental Moonduo implementation (https://github.com/FluxML/Fluxperimental.jl/blob/master/ext/FluxMooncakeExt.jl) rather than something with Mooncake? here is a MWE if useful.

Simple case of two standard feed forward layers. Benchmarks below with Flux v0.15.2, Zygote v0.6.75 (constrained from updating further by Fluxperimental), Enzyme v0.13.30, Mooncake v0.4.83, on Julia 1.10.5.

using Flux
using BenchmarkTools

# Create random inputs and targets
const MINIBATCHSIZE = 64
X = rand(Float32, 100, MINIBATCHSIZE)
Y = rand(Float32, 20, MINIBATCHSIZE)

# Create trivial Flux NN and loss 
model = Chain(Dense(100, 50, relu), 
              Dense(50, 20))
myloss(m, x, y) = Flux.mse(m(x), y)

# Compare time to first gradient (restarting the session for each example):

using Zygote
@btime loss, grads = Flux.withgradient($myloss, $model, $X, $Y)
# 81.875 μs (87 allocations: 126.46 KiB)

using Enzyme
@btime loss, grads = Flux.withgradient($myloss, $Duplicated(model), $X, $Y)
# 82.875 μs (129 allocations: 84.27 KiB)

using Fluxperimental, Mooncake
@btime loss, grads = Flux.withgradient($myloss, $Moonduo(model), $Moonduo(X), $Moonduo(Y))
# 837.625 μs (16045 allocations: 1.89 MiB)

fclosure(m) = myloss(m, X, Y)
@btime loss, grads = Flux.withgradient($fclosure, $Moonduo(model))
# 919.000 μs (16048 allocations: 1.89 MiB)
@willtebbutt
Copy link
Member

Thank you very much for this example, I really appreciate you following up on the slack thread! I'll have a look to see if I can figure out what's going on.

@willtebbutt willtebbutt added the enhancement (performance) Would reduce the time it takes to run some bit of the code label Feb 9, 2025
@willtebbutt
Copy link
Member

willtebbutt commented Feb 9, 2025

So it looks like we might need to improve Fluxperimentals use of Mooncake. I see similar timings to you when I run your code:

julia> @btime loss, grads = Flux.withgradient($myloss, $model, $X, $Y);
  44.666 μs (102 allocations: 124.66 KiB)

julia> @benchmark Flux.withgradient($myloss, $(Moonduo(model)), $(Moonduo(X)), $(Moonduo(Y)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  269.167 μs   29.130 ms  ┊ GC (min  max):  0.00%  98.73%
 Time  (median):     311.833 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   373.000 μs ± 655.422 μs  ┊ GC (mean ± σ):  16.42% ± 10.48%

  █                                                              
  █▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂ ▂
  269 μs           Histogram: frequency by time         4.06 ms <

 Memory estimate: 1.13 MiB, allocs estimate: 701.

However, it looks like Fluxperimental is failing to re-use the rule that Mooncake constructs each time (which makes sense -- there's no state being passed around, so I don't see how it could). I find that if you make use of Mooncake's rule caching functionality, you see a substantial improvement in performance:

julia> fargs = (myloss, model, X, Y);

julia> cache = Mooncake.prepare_gradient_cache(fargs...);

julia> @benchmark Mooncake.value_and_gradient!!($cache, $fargs...)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  112.875 μs  161.791 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     120.208 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   120.802 μs ±   3.626 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▃█▇▇▂                                        
  ▁▁▁▁▂▂▂▃▃▃▄▄▄▃▃▃▅███████▇▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  113 μs           Histogram: frequency by time          134 μs <

 Memory estimate: 64.33 KiB, allocs estimate: 37.

There's still a bit of a gap though. There's still a bit of a performance gap though, so I've done a bit of profiling:
Image
So it looks like the remaining performance gap is the same kind of thing that we're discussing in #156 . i.e. it's mostly looping-related overhead.

I think the first priority here should be seeing if @mcabbott is up for the principle of modifying Fluxperimental to make use of the Mooncake.prepare_gradient_cache so that we can get a decent amount of stack re-use.

@gdalle
Copy link
Collaborator

gdalle commented Feb 9, 2025

Note that it could also be DI.prepare_gradient, which additionally works across backends

@mcabbott
Copy link

mcabbott commented Feb 9, 2025

Note that $(Moonduo(model)) at least saves & re-uses the allocation for the gradient, while $Moonduo(model) does not. This is the only "state" which is stored in the present design.

But changing things to store more would be fine, it's certainly all experimental!

I started something perhaps similar for Reactant.jl in FluxML/Fluxperimental.jl#28 , where mr = Reactor(model) stores not just the allocation for the gradient, but also some compiled version. This is saved inside mr the first time you call gradient(loss, mr, x, y), and each subsequent call checks that loss is the same and x, y are of the expected type & size before proceeding.

(Edit, That's a slightly magical design (rather than asking you to make some special object up front & re-use it) but it does mean you can use it in a loop over for (x,y) in somedata without having to change the code much to e.g. get first(somedata) and have the loss function in two places... )

@willtebbutt
Copy link
Member

I started something perhaps similar for Reactant.jl in FluxML/Fluxperimental.jl#28 , where mr = Reactor(model) stores not just...

Ooo yes, this would be the kind of thing that we would need. I think that the only additional thing that you need to store is the result of build_rrule.

@mcabbott
Copy link

mcabbott commented Feb 9, 2025

One thing to note here is that, besides allocating the gradient, Fluxperimental is "standardising" it to a form Flux (really Optimisers.jl) understands -- a nested set of structs with the same field names as the model. These are all NamedTuples & Tuples here (like Zygore & unlike Enzyme). What Mooncake produces by default is more elaborate, with nested Tangent objects whose .fields contains a NamedTuple/Tuple. Hence this error:

julia> opt = Flux.setup(Adam(), model);
julia> loss, grads = Flux.withgradient(myloss, model, X, Y);  # Zygote
julia> Flux.update!(opt, model, grads[1]);  # this should work... check that at least it runs!

julia> loss, grads = Flux.withgradient(myloss, Moonduo(model), Moonduo(X), Moonduo(Y));
julia> Flux.update!(opt, model, grads[1]);  # still works, "grads" has same structrure

julia> cache = Mooncake.prepare_gradient_cache(myloss, model, X, Y);
julia> Mooncake.value_and_gradient!!(cache, myloss, model, X, Y);
julia> Flux.update!(opt, model, cache.tangents[2])
ERROR: type Tangent has no field layers

That's also the reason DI isn't a straightforward solution to anything Flux, as its position is (I think) to return whatever objects the backend prefers. Trying today, I'm not sure I see how prepare_gradient works?

julia> using DifferentiationInterface

help?> prepare_gradient
search: prepare_gradient prepare_hessian value_and_gradient value_and_gradient! prepare_jacobian

  prepare_gradient(f, backend, x, [contexts...]) -> prep

  Create a prep object that can be given to gradient and its variants.

  │ Warning
  │
  │  If the function changes in any way, the result of preparation will be invalidated, and
  │  you will need to run it again.

julia> prep = prepare_gradient(myloss, backend, model, X, Y);  # maybe X, Y aren't context?
ERROR: MethodError: no method matching prepare_gradient(::typeof(myloss), ::AutoMooncake{Nothing}, ::Chain{Tuple{…}}, ::Matrix{Float32}, ::Matrix{Float32})

julia> prep = prepare_gradient(m -> myloss(m, X, Y), backend, model);  # is this a bug or user error?
ERROR: MethodError: no method matching copy(::Tangent{@NamedTuple{layers::Tuple{Tangent{@NamedTuple{…}}, Tangent{@NamedTuple{…}}}}})
Stacktrace:
 [1] value_and_pullback(::var"#13#14", ::DifferentiationInterfaceMooncakeExt.MooncakeOneArgPullbackPrep{…}, ::AutoMooncake{…}, ::Chain{…}, ::Tuple{…})
   @ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:33
 [2] prepare_pullback(::Function, ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}}, ::Tuple{Bool})
   @ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:16
 [3] prepare_gradient(::var"#13#14", ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/mXEZA/src/first_order/gradient.jl:70

julia> prep = prepare_gradient(mxy -> myloss(mxy...), backend, (model, X, Y))
ERROR: MethodError: no method matching copy(::Tuple{Tangent{@NamedTuple{layers::Tuple{Tangent{…}, Tangent{…}}}}, Matrix{Float32}, Matrix{Float32}})

@gdalle
Copy link
Collaborator

gdalle commented Feb 9, 2025

That's also the reason DI isn't a straightforward solution to anything Flux, as its position is (I think) to return whatever objects the backend prefers.

Indeed, as of right now DI makes no effort to standardize outputs because different backends have different conventions for nested structs, which fields are active, etc. I think it may be more trouble than it's worth because every downstream package will have different standards as well. Happy to discuss further though.

Trying today, I'm not sure I see how prepare_gradient works?

The first error you see is because there can only be one active argument, as explained on this page. So the correct syntax would be

prep = prepare_gradient(myloss, backend, model, Constant(X), Constant(Y));

However, then you'll encounter the second error, which is exactly #467. We're brainstomring a solution as we speak :)

@gdalle
Copy link
Collaborator

gdalle commented Feb 9, 2025

Fix incoming in JuliaDiff/DifferentiationInterface.jl#723

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement (performance) Would reduce the time it takes to run some bit of the code
Projects
None yet
Development

No branches or pull requests

4 participants