Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing imports in FluxMPIExt #2589

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Alexander-Barth
Copy link

@Alexander-Barth Alexander-Barth commented Feb 13, 2025

With ROCM-aware MPI, the initialization,

import AMDGPU
using Flux
import MPI
AMDGPU.allowscalar(false)
DistributedUtils.initialize(MPIBackend)

fails with (output from 2 MPI tasks):

ERROR: LoadError: UndefVarError: `AMDGPU` not defined in `FluxMPIExt`
Stacktrace:
ERROR: LoadError: UndefVarError: `AMDGPU` not defined in `FluxMPIExt`
Stacktrace:
 [1] __initialize(::Type{MPIBackend}; cuda_devices::Nothing, amdgpu_devices::Nothing, force_cuda::Bool, caller::String, force_amdgpu::Bool)
 [1] __initialize(::Type{MPIBackend}; cuda_devices::Nothing, amdgpu_devices::Nothing, force_cuda::Bool, caller::String, force_amdgpu::Bool)
   @ FluxMPIExt ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:27
   @ FluxMPIExt ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:27
 [2] __initialize
 [2] __initialize
   @ ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:9 [inlined]
 [3] #initialize#1
   @ ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:9 [inlined]
   @ ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:47 [inlined]
 [3] #initialize#1
   @ ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:47 [inlined]
 [4] initialize(backend::Type{MPIBackend})
 [4] initialize(backend::Type{MPIBackend})
   @ Flux.DistributedUtils ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:45
 [5] top-level scope
   @ Flux.DistributedUtils ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:45

I had also the error UndefVarError: cpu_device not defined in FluxMPIExt.
These two issues are addressed in this PR.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable

@mcabbott
Copy link
Member

Thanks. Can your example be included as a test somewhere?

Other test failures seem to be Reactant.jl, cc @wsmoses about what changed.

@wsmoses
Copy link
Contributor

wsmoses commented Feb 13, 2025

the reactant that CI found is rather old, what if you update it to latest? the error thrown shouldn't ever happen on the latest so I'm a bit surprised

@Alexander-Barth
Copy link
Author

Alexander-Barth commented Feb 14, 2025

If I just enable FLUX_TEST_DISTRIBUTED_MPI:

ENV["FLUX_TEST_AMDGPU"] = "false"
ENV["FLUX_TEST_CUDA"] = "false"
ENV["FLUX_TEST_METAL"] = "false"
ENV["FLUX_TEST_CPU"] = "false"
ENV["FLUX_TEST_DISTRIBUTED_MPI"] = "true"
ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "false"
ENV["FLUX_TEST_ENZYME"] = "false"
ENV["FLUX_TEST_REACTANT"] = "false"

It seems that there are no MPI test currently:

Test Summary: | Total   Time
Flux.jl       |     0  25.3s
  Distributed |     0   0.9s
     Testing Flux tests passed 

as there are no files ending with _distributedtest.jl:

isdistributedtest(f) = endswith(f, "_distributedtest.jl")

Let me know if I miss something. I will try to add a test.

@Alexander-Barth
Copy link
Author

Alexander-Barth commented Feb 14, 2025

I just added a test which initializes the backend and does a simple reduction.
It is tested with (pure) MPI and NCCL:

julia> ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "true"
julia> include("/home/abarth/.julia/dev/Flux/test/ext_distributed/runtests.jl");
[ Info: Running Distributed Tests with 4 processes
[ Info: Running /home/abarth/.julia/dev/Flux/test/ext_distributed/reduce_distributedtest.jl with mpi backend
[ Info: Running /home/abarth/.julia/dev/Flux/test/ext_distributed/reduce_distributedtest.jl with nccl backend
Test Summary: | Pass  Total   Time
Distributed   |    2      2  13.1s

I verified that the MPI test indeed fails without the PR.

Is there anything else needed before the PR can be merged? Should I do an entry in NEWS.md?

@Alexander-Barth Alexander-Barth changed the title Fix FluxMPIExt for AMDGPU Fix FluxMPIExt Feb 14, 2025
@Alexander-Barth Alexander-Barth changed the title Fix FluxMPIExt Fix missing import in FluxMPIExt Feb 14, 2025
@Alexander-Barth Alexander-Barth changed the title Fix missing import in FluxMPIExt Fix missing imports in FluxMPIExt Feb 14, 2025
@mcabbott
Copy link
Member

Thanks for adding tests.

Have never touched MPI stuff. I think @askorupka wrote most of this, and may have opinions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants