-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mooncake less efficient than Zygote or Enzyme on Flux layers #466
Comments
Thank you very much for this example, I really appreciate you following up on the slack thread! I'll have a look to see if I can figure out what's going on. |
So it looks like we might need to improve julia> @btime loss, grads = Flux.withgradient($myloss, $model, $X, $Y);
44.666 μs (102 allocations: 124.66 KiB)
julia> @benchmark Flux.withgradient($myloss, $(Moonduo(model)), $(Moonduo(X)), $(Moonduo(Y)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 269.167 μs … 29.130 ms ┊ GC (min … max): 0.00% … 98.73%
Time (median): 311.833 μs ┊ GC (median): 0.00%
Time (mean ± σ): 373.000 μs ± 655.422 μs ┊ GC (mean ± σ): 16.42% ± 10.48%
█
█▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂ ▂
269 μs Histogram: frequency by time 4.06 ms <
Memory estimate: 1.13 MiB, allocs estimate: 701. However, it looks like Fluxperimental is failing to re-use the rule that Mooncake constructs each time (which makes sense -- there's no state being passed around, so I don't see how it could). I find that if you make use of Mooncake's rule caching functionality, you see a substantial improvement in performance: julia> fargs = (myloss, model, X, Y);
julia> cache = Mooncake.prepare_gradient_cache(fargs...);
julia> @benchmark Mooncake.value_and_gradient!!($cache, $fargs...)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 112.875 μs … 161.791 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 120.208 μs ┊ GC (median): 0.00%
Time (mean ± σ): 120.802 μs ± 3.626 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃█▇▇▂
▁▁▁▁▂▂▂▃▃▃▄▄▄▃▃▃▅███████▇▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
113 μs Histogram: frequency by time 134 μs <
Memory estimate: 64.33 KiB, allocs estimate: 37. There's still a bit of a gap though. There's still a bit of a performance gap though, so I've done a bit of profiling: I think the first priority here should be seeing if @mcabbott is up for the principle of modifying Fluxperimental to make use of the |
Note that it could also be |
Note that But changing things to store more would be fine, it's certainly all experimental! I started something perhaps similar for Reactant.jl in FluxML/Fluxperimental.jl#28 , where (Edit, That's a slightly magical design (rather than asking you to make some special object up front & re-use it) but it does mean you can use it in a loop over |
Ooo yes, this would be the kind of thing that we would need. I think that the only additional thing that you need to store is the result of |
One thing to note here is that, besides allocating the gradient, Fluxperimental is "standardising" it to a form Flux (really Optimisers.jl) understands -- a nested set of structs with the same field names as the model. These are all NamedTuples & Tuples here (like Zygore & unlike Enzyme). What Mooncake produces by default is more elaborate, with nested julia> opt = Flux.setup(Adam(), model);
julia> loss, grads = Flux.withgradient(myloss, model, X, Y); # Zygote
julia> Flux.update!(opt, model, grads[1]); # this should work... check that at least it runs!
julia> loss, grads = Flux.withgradient(myloss, Moonduo(model), Moonduo(X), Moonduo(Y));
julia> Flux.update!(opt, model, grads[1]); # still works, "grads" has same structrure
julia> cache = Mooncake.prepare_gradient_cache(myloss, model, X, Y);
julia> Mooncake.value_and_gradient!!(cache, myloss, model, X, Y);
julia> Flux.update!(opt, model, cache.tangents[2])
ERROR: type Tangent has no field layers That's also the reason DI isn't a straightforward solution to anything Flux, as its position is (I think) to return whatever objects the backend prefers. Trying today, I'm not sure I see how julia> using DifferentiationInterface
help?> prepare_gradient
search: prepare_gradient prepare_hessian value_and_gradient value_and_gradient! prepare_jacobian
prepare_gradient(f, backend, x, [contexts...]) -> prep
Create a prep object that can be given to gradient and its variants.
│ Warning
│
│ If the function changes in any way, the result of preparation will be invalidated, and
│ you will need to run it again.
julia> prep = prepare_gradient(myloss, backend, model, X, Y); # maybe X, Y aren't context?
ERROR: MethodError: no method matching prepare_gradient(::typeof(myloss), ::AutoMooncake{Nothing}, ::Chain{Tuple{…}}, ::Matrix{Float32}, ::Matrix{Float32})
julia> prep = prepare_gradient(m -> myloss(m, X, Y), backend, model); # is this a bug or user error?
ERROR: MethodError: no method matching copy(::Tangent{@NamedTuple{layers::Tuple{Tangent{@NamedTuple{…}}, Tangent{@NamedTuple{…}}}}})
Stacktrace:
[1] value_and_pullback(::var"#13#14", ::DifferentiationInterfaceMooncakeExt.MooncakeOneArgPullbackPrep{…}, ::AutoMooncake{…}, ::Chain{…}, ::Tuple{…})
@ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:33
[2] prepare_pullback(::Function, ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}}, ::Tuple{Bool})
@ DifferentiationInterfaceMooncakeExt ~/.julia/packages/DifferentiationInterface/mXEZA/ext/DifferentiationInterfaceMooncakeExt/onearg.jl:16
[3] prepare_gradient(::var"#13#14", ::AutoMooncake{Nothing}, ::Chain{Tuple{Dense{…}, Dense{…}}})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/mXEZA/src/first_order/gradient.jl:70
julia> prep = prepare_gradient(mxy -> myloss(mxy...), backend, (model, X, Y))
ERROR: MethodError: no method matching copy(::Tuple{Tangent{@NamedTuple{layers::Tuple{Tangent{…}, Tangent{…}}}}, Matrix{Float32}, Matrix{Float32}}) |
Indeed, as of right now DI makes no effort to standardize outputs because different backends have different conventions for nested structs, which fields are active, etc. I think it may be more trouble than it's worth because every downstream package will have different standards as well. Happy to discuss further though.
The first error you see is because there can only be one active argument, as explained on this page. So the correct syntax would be prep = prepare_gradient(myloss, backend, model, Constant(X), Constant(Y)); However, then you'll encounter the second error, which is exactly #467. We're brainstomring a solution as we speak :) |
Fix incoming in JuliaDiff/DifferentiationInterface.jl#723 |
(This is a follow-up to a slack thread, cc @willtebbutt & @gdalle )
I have been comparing performance of different autodiff backends for training Flux models. I get one order of magnitude worse performance from Mooncake compared to Enzyme and even Zygote in most cases. This may be due to the Fluxperimental Moonduo implementation (https://github.com/FluxML/Fluxperimental.jl/blob/master/ext/FluxMooncakeExt.jl) rather than something with Mooncake? here is a MWE if useful.
Simple case of two standard feed forward layers. Benchmarks below with Flux v0.15.2, Zygote v0.6.75 (constrained from updating further by Fluxperimental), Enzyme v0.13.30, Mooncake v0.4.83, on Julia 1.10.5.
The text was updated successfully, but these errors were encountered: