PyTorch and MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg · 2024-05-21T19:58:46Z

On my local machine, PyTorch has some internal multithreaded functionality that doesn't get along with AMReX. Unless I set PyTorch.set_num_threads(1 or 2), then the attached script will hang when the neural network tries to set its initial parameters.

This script downloads some neural network parameters from Zenodo archive to then load them, and the load_state_dict function is the specific point of failure.

pytorch_amrex_hang_reproducer_v2.py.txt

The text was updated successfully, but these errors were encountered:

ax3l · 2024-05-22T19:03:48Z

Thank you, @RTSandberg !

For reproducibility, can you please add the OS you used, versions of Python, pyAMReX, PyTorch, MPI flavor and version, and mpi4py version?

ax3l · 2024-05-22T19:09:02Z

If we can reduce this problem to a pure mpi4py + PyTorch issue, then we could also report this upstream in PyTorch: https://github.com/pytorch/pytorch/issues

RTSandberg mentioned this issue May 21, 2024

set num threads to avoid hanging ECP-WarpX/impactx#619

Merged

ax3l added component: third party Changes in ImpactX that reflect a change in a third-party library component: MPI Domain decomposition and communication bug Something isn't working bug: affects latest release Bug also exists in latest release version labels May 22, 2024

ax3l changed the title ~~PyTorch and mpi-enabled AMReX don't get along~~ PyTorch and MPI-enabled AMReX don't get along May 22, 2024

ax3l changed the title ~~PyTorch and MPI-enabled AMReX don't get along~~ PyTorch and MPI-enabled AMReX don't get along in load_state_dict May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch and MPI-enabled AMReX don't get along in `load_state_dict` #322

PyTorch and MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg commented May 21, 2024 •

edited by ax3l

Loading

ax3l commented May 22, 2024 •

edited

Loading

ax3l commented May 22, 2024

PyTorch and MPI-enabled AMReX don't get along in load_state_dict #322

PyTorch and MPI-enabled AMReX don't get along in load_state_dict #322

Comments

RTSandberg commented May 21, 2024 • edited by ax3l Loading

ax3l commented May 22, 2024 • edited Loading

ax3l commented May 22, 2024

PyTorch and MPI-enabled AMReX don't get along in `load_state_dict` #322

PyTorch and MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg commented May 21, 2024 •

edited by ax3l

Loading

ax3l commented May 22, 2024 •

edited

Loading