Mooncake less efficient than Zygote or Enzyme on Flux layers #466

m-laprise · 2025-02-09T19:31:51Z

(This is a follow-up to a slack thread, cc @willtebbutt & @gdalle )

I have been comparing performance of different autodiff backends for training Flux models. I get one order of magnitude worse performance from Mooncake compared to Enzyme and even Zygote in most cases. This may be due to the Fluxperimental Moonduo implementation (https://github.com/FluxML/Fluxperimental.jl/blob/master/ext/FluxMooncakeExt.jl) rather than something with Mooncake? here is a MWE if useful.

Simple case of two standard feed forward layers. Benchmarks below with Flux v0.15.2, Zygote v0.6.75 (constrained from updating further by Fluxperimental), Enzyme v0.13.30, Mooncake v0.4.83, on Julia 1.10.5.

using Flux
using BenchmarkTools

# Create random inputs and targets
const MINIBATCHSIZE = 64
X = rand(Float32, 100, MINIBATCHSIZE)
Y = rand(Float32, 20, MINIBATCHSIZE)

# Create trivial Flux NN and loss 
model = Chain(Dense(100, 50, relu), 
              Dense(50, 20))
myloss(m, x, y) = Flux.mse(m(x), y)

# Compare time to first gradient (restarting the session for each example):

using Zygote
@btime loss, grads = Flux.withgradient($myloss, $model, $X, $Y)
# 81.875 μs (87 allocations: 126.46 KiB)

using Enzyme
@btime loss, grads = Flux.withgradient($myloss, $Duplicated(model), $X, $Y)
# 82.875 μs (129 allocations: 84.27 KiB)

using Fluxperimental, Mooncake
@btime loss, grads = Flux.withgradient($myloss, $Moonduo(model), $Moonduo(X), $Moonduo(Y))
# 837.625 μs (16045 allocations: 1.89 MiB)

fclosure(m) = myloss(m, X, Y)
@btime loss, grads = Flux.withgradient($fclosure, $Moonduo(model))
# 919.000 μs (16048 allocations: 1.89 MiB)

The text was updated successfully, but these errors were encountered:

willtebbutt · 2025-02-09T19:44:49Z

Thank you very much for this example, I really appreciate you following up on the slack thread! I'll have a look to see if I can figure out what's going on.

willtebbutt · 2025-02-09T20:58:39Z

So it looks like we might need to improve Fluxperimentals use of Mooncake. I see similar timings to you when I run your code:

julia> @btime loss, grads = Flux.withgradient($myloss, $model, $X, $Y);
  44.666 μs (102 allocations: 124.66 KiB)

julia> @benchmark Flux.withgradient($myloss, $(Moonduo(model)), $(Moonduo(X)), $(Moonduo(Y)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  269.167 μs …  29.130 ms  ┊ GC (min … max):  0.00% … 98.73%
 Time  (median):     311.833 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   373.000 μs ± 655.422 μs  ┊ GC (mean ± σ):  16.42% ± 10.48%

  █                                                              
  █▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂ ▂
  269 μs           Histogram: frequency by time         4.06 ms <

 Memory estimate: 1.13 MiB, allocs estimate: 701.

However, it looks like Fluxperimental is failing to re-use the rule that Mooncake constructs each time (which makes sense -- there's no state being passed around, so I don't see how it could). I find that if you make use of Mooncake's rule caching functionality, you see a substantial improvement in performance:

julia> fargs = (myloss, model, X, Y);

julia> cache = Mooncake.prepare_gradient_cache(fargs...);

julia> @benchmark Mooncake.value_and_gradient!!($cache, $fargs...)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  112.875 μs … 161.791 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     120.208 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   120.802 μs ±   3.626 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▃█▇▇▂                                        
  ▁▁▁▁▂▂▂▃▃▃▄▄▄▃▃▃▅███████▇▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  113 μs           Histogram: frequency by time          134 μs <

 Memory estimate: 64.33 KiB, allocs estimate: 37.

There's still a bit of a gap though. There's still a bit of a performance gap though, so I've done a bit of profiling:

So it looks like the remaining performance gap is the same kind of thing that we're discussing in #156 . i.e. it's mostly looping-related overhead.

I think the first priority here should be seeing if @mcabbott is up for the principle of modifying Fluxperimental to make use of the Mooncake.prepare_gradient_cache so that we can get a decent amount of stack re-use.

gdalle · 2025-02-09T21:15:46Z

Note that it could also be DI.prepare_gradient, which additionally works across backends

mcabbott · 2025-02-09T21:15:52Z

Note that $(Moonduo(model)) at least saves & re-uses the allocation for the gradient, while $Moonduo(model) does not. This is the only "state" which is stored in the present design.

But changing things to store more would be fine, it's certainly all experimental!

I started something perhaps similar for Reactant.jl in FluxML/Fluxperimental.jl#28 , where mr = Reactor(model) stores not just the allocation for the gradient, but also some compiled version. This is saved inside mr the first time you call gradient(loss, mr, x, y), and each subsequent call checks that loss is the same and x, y are of the expected type & size before proceeding.

(Edit, That's a slightly magical design (rather than asking you to make some special object up front & re-use it) but it does mean you can use it in a loop over for (x,y) in somedata without having to change the code much to e.g. get first(somedata) and have the loss function in two places... )

willtebbutt · 2025-02-09T21:30:28Z

I started something perhaps similar for Reactant.jl in FluxML/Fluxperimental.jl#28 , where mr = Reactor(model) stores not just...

Ooo yes, this would be the kind of thing that we would need. I think that the only additional thing that you need to store is the result of build_rrule.

mcabbott · 2025-02-09T21:40:51Z

One thing to note here is that, besides allocating the gradient, Fluxperimental is "standardising" it to a form Flux (really Optimisers.jl) understands -- a nested set of structs with the same field names as the model. These are all NamedTuples & Tuples here (like Zygore & unlike Enzyme). What Mooncake produces by default is more elaborate, with nested Tangent objects whose .fields contains a NamedTuple/Tuple. Hence this error:

julia> opt = Flux.setup(Adam(), model);
julia> loss, grads = Flux.withgradient(myloss, model, X, Y);  # Zygote
julia> Flux.update!(opt, model, grads[1]);  # this should work... check that at least it runs!

julia> loss, grads = Flux.withgradient(myloss, Moonduo(model), Moonduo(X), Moonduo(Y));
julia> Flux.update!(opt, model, grads[1]);  # still works, "grads" has same structrure

julia> cache = Mooncake.prepare_gradient_cache(myloss, model, X, Y);
julia> Mooncake.value_and_gradient!!(cache, myloss, model, X, Y);
julia> Flux.update!(opt, model, cache.tangents[2])
ERROR: type Tangent has no field layers

That's also the reason DI isn't a straightforward solution to anything Flux, as its position is (I think) to return whatever objects the backend prefers. Trying today, I'm not sure I see how prepare_gradient works?

julia> using DifferentiationInterface

help?> prepare_gradient
search: prepare_gradient prepare_hessian value_and_gradient value_and_gradient! prepare_jacobian

  prepare_gradient(f, backend, x, [contexts...]) -> prep

  Create a prep object that can be given to gradient and its variants.

  │ Warning
  │
  │  If the function changes in any way, the result of preparation will be invalidated, and
  │  you will need to run it again.

julia> prep = prepare_gradient(myloss, backend, model, X, Y);  # maybe X, Y aren't context?
ERROR: MethodError: no method matching prepare_gradient(::typeof(myloss), ::AutoMooncake{Nothing}, ::Chain{Tuple{…}}, ::Matrix{Float32}, ::Matrix{Float32})

julia> prep = prepare_gradient(m -> myloss(m, X, Y), backend, model);  # is this a bug or user error?
ERROR: MethodError: no method matching copy(::Tangent{@NamedTuple{layers::Tuple{Tangent{@NamedTuple{…}}, Tangent{@NamedTuple{…}}}}})
Stacktrace:
 [1] value_and_pullback(::var"#13#14", ::DifferentiationInterfaceMooncakeExt.MooncakeOneArgPullbackPrep{…}, ::AutoMooncake{…}, ::Chain{…}, ::Tuple{…})
   @ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:33
 [2] prepare_pullback(::Function, ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}}, ::Tuple{Bool})
   @ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:16
 [3] prepare_gradient(::var"#13#14", ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/mXEZA/src/first_order/gradient.jl:70

julia> prep = prepare_gradient(mxy -> myloss(mxy...), backend, (model, X, Y))
ERROR: MethodError: no method matching copy(::Tuple{Tangent{@NamedTuple{layers::Tuple{Tangent{…}, Tangent{…}}}}, Matrix{Float32}, Matrix{Float32}})

gdalle · 2025-02-09T21:48:59Z

That's also the reason DI isn't a straightforward solution to anything Flux, as its position is (I think) to return whatever objects the backend prefers.

Indeed, as of right now DI makes no effort to standardize outputs because different backends have different conventions for nested structs, which fields are active, etc. I think it may be more trouble than it's worth because every downstream package will have different standards as well. Happy to discuss further though.

Trying today, I'm not sure I see how prepare_gradient works?

The first error you see is because there can only be one active argument, as explained on this page. So the correct syntax would be

prep = prepare_gradient(myloss, backend, model, Constant(X), Constant(Y));

However, then you'll encounter the second error, which is exactly #467. We're brainstomring a solution as we speak :)

gdalle · 2025-02-09T23:01:54Z

Fix incoming in JuliaDiff/DifferentiationInterface.jl#723

willtebbutt added the enhancement (performance) Would reduce the time it takes to run some bit of the code label Feb 9, 2025

This was referenced Feb 9, 2025

How to use Mooncake in Lux #467

Open

Utility for stripping tangents #472

Open

AstitvaAggarwal mentioned this issue Mar 8, 2025

Rule for Flux.MSE #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

m-laprise commented Feb 9, 2025

willtebbutt commented Feb 9, 2025

willtebbutt commented Feb 9, 2025 •

edited

Loading

gdalle commented Feb 9, 2025

mcabbott commented Feb 9, 2025 •

edited

Loading

willtebbutt commented Feb 9, 2025

mcabbott commented Feb 9, 2025 •

edited

Loading

gdalle commented Feb 9, 2025 •

edited

Loading

gdalle commented Feb 9, 2025

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

Mooncake less efficient than Zygote or Enzyme on Flux layers #466

Comments

m-laprise commented Feb 9, 2025

willtebbutt commented Feb 9, 2025

willtebbutt commented Feb 9, 2025 • edited Loading

gdalle commented Feb 9, 2025

mcabbott commented Feb 9, 2025 • edited Loading

willtebbutt commented Feb 9, 2025

mcabbott commented Feb 9, 2025 • edited Loading

gdalle commented Feb 9, 2025 • edited Loading

gdalle commented Feb 9, 2025

willtebbutt commented Feb 9, 2025 •

edited

Loading

mcabbott commented Feb 9, 2025 •

edited

Loading

mcabbott commented Feb 9, 2025 •

edited

Loading

gdalle commented Feb 9, 2025 •

edited

Loading