-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device to Device transfers don't work with OpenMPI + LinkX provider on AMD GPUs #10712
Comments
This is probably related to open-mpi/ompi#11076 |
@hppritcha Ah, thanks! do you think it's enough to apply open-mpi/ompi#12290 ? I can check with |
@hppritcha Yes, that was it 👍 It seems the PR was merged into |
maybe add a patch to the Open mpi spackage for PR 12290? |
sorry i'm going back and forth between spack world and this one and got things mixed up. |
Should I report this in the OpenMPI issues? I guess it belongs there and not really here.. |
yes please open an issue and I'll assign it to the author of 12290. |
I have reported this to ompi issues: open-mpi/ompi#13048 This issue can be closed. |
I do not know if this is a bug, if I'm doing something wrong, or if this is simply not (yet?) expected to work. I understand that the LinkX provider is experimental - just trying to make it work and would appreciate any insights.
Describe the bug
OpenMPI with
shm+cxi:lnx
fails to perform Device - Device transfers on LUMI system (AMD GPUs) with OSU benchmark. Host - Host transfers work as expected for intra- and inter-node transfers.To Reproduce
I have compiled libfabric 2.0.0 on LUMI, with all the new
https://github.com/HewlettPackard/shs-*
repositories for cassini headers, cxi driver, and libcxi. I have then compiled OpenMPI against that libfabric and I'm trying to use the LinxX provider. Things work for Host - Host transfers (OSU), both for intra-node communication:and inter-node communication:
but the Device - Device transfers fail both for intra- and inter-node
Looking at OpenMPI code, a call to
fi_mr_regattr
fails with the above message (probablyFI_EKEYREJECTED
? the cryptic4294967030
is in fact-9
)Environment:
This is on Cray LUMI system, but the cxi driver is taken from
https://github.com/HewlettPackard/shs-*
repos.The text was updated successfully, but these errors were encountered: