Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch and MPI-enabled AMReX don't get along in load_state_dict #322

Open
RTSandberg opened this issue May 21, 2024 · 2 comments
Open

PyTorch and MPI-enabled AMReX don't get along in load_state_dict #322

RTSandberg opened this issue May 21, 2024 · 2 comments
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: MPI Domain decomposition and communication component: third party Changes in ImpactX that reflect a change in a third-party library

Comments

@RTSandberg
Copy link
Member

RTSandberg commented May 21, 2024

On my local machine, PyTorch has some internal multithreaded functionality that doesn't get along with AMReX. Unless I set PyTorch.set_num_threads(1 or 2), then the attached script will hang when the neural network tries to set its initial parameters.

This script downloads some neural network parameters from Zenodo archive to then load them, and the load_state_dict function is the specific point of failure.

pytorch_amrex_hang_reproducer_v2.py.txt

@ax3l ax3l added component: third party Changes in ImpactX that reflect a change in a third-party library component: MPI Domain decomposition and communication bug Something isn't working bug: affects latest release Bug also exists in latest release version labels May 22, 2024
@ax3l
Copy link
Member

ax3l commented May 22, 2024

Thank you, @RTSandberg !

For reproducibility, can you please add the OS you used, versions of Python, pyAMReX, PyTorch, MPI flavor and version, and mpi4py version?

@ax3l ax3l changed the title PyTorch and mpi-enabled AMReX don't get along PyTorch and MPI-enabled AMReX don't get along May 22, 2024
@ax3l ax3l changed the title PyTorch and MPI-enabled AMReX don't get along PyTorch and MPI-enabled AMReX don't get along in load_state_dict May 22, 2024
@ax3l
Copy link
Member

ax3l commented May 22, 2024

If we can reduce this problem to a pure mpi4py + PyTorch issue, then we could also report this upstream in PyTorch: https://github.com/pytorch/pytorch/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: MPI Domain decomposition and communication component: third party Changes in ImpactX that reflect a change in a third-party library
Projects
None yet
Development

No branches or pull requests

2 participants