Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set CHPL_RT_MASTERIP in hopes that it will help with our recent gasnet timeouts #26714

Merged

Conversation

bradcray
Copy link
Member

@bradcray bradcray commented Feb 14, 2025

Recently, GASNet test configurations have been suffering from a timeout error across multiple tests that manifests as:

*** GASNET WARNING: AMUDP_SPMDStartup_AMUDP_NDEBUG returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
  from function AMUDP_SPMDStartup
  at /ptmp/jenkins/chapel-ci/workspace/chapcs-correctness-test-gasnet-everything/third-party/gasnet/gasnet-src/other/amudp/amudp_spmd.cpp:1026
  reason: worker failed DNSLookup on master host name

Michael hypothesized that setting CHPL_RT_MASTERIP may help with this failure mode, and I agree it's worth a try, so this adds it to common-gasnet.bash (where we also set up local launching and the like) to see what happens.

This could help with resolving https://github.com/Cray/chapel-private/issues/7062, but I'm not confident.

…t timeouts

Recently, GASNet timeouts have been suffering from an error:

```
*** GASNET WARNING: AMUDP_SPMDStartup_AMUDP_NDEBUG returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
  from function AMUDP_SPMDStartup
  at /ptmp/jenkins/chapel-ci/workspace/chapcs-correctness-test-gasnet-everything/third-party/gasnet/gasnet-src/other/amudp/amudp_spmd.cpp:1026
  reason: worker failed DNSLookup on master host name
```

Michael hypothesizes that setting CHPL_RT_MASTERIP may help with this
failure mode, and I think it's worth a try, so this adds it to
common-oversubscribed.bash to see what happens.

---
Signed-off-by: Brad Chamberlain <[email protected]>
@bradcray
Copy link
Member Author

@mppf / @riftEmber / @e-kayrakli / @tzinsky : Can one or more of you give this a look and see what you think? I'm not confident it will help, but if this file is used whenever we oversubscribe GASNet it shouldn't hurt.

Michael points out that "oversubscribed" here isn't particularly
GASNet-specific, and generally applies to running multiple things on a
node simultaneously (potentially multiple single-locale Chapel
programs, potentially programs using other CHPL_COMM layers), and that
common-gasnet.bash contains code that's more specific to local,
oversubscribed GASNet launching, which is a great point — so moving
the setting there instead to be more precise.

---
Signed-off-by: Brad Chamberlain <[email protected]>
util/cron/common-gasnet.bash Outdated Show resolved Hide resolved
@bradcray bradcray merged commit 0f2f4f0 into chapel-lang:main Feb 14, 2025
9 checks passed
@bradcray bradcray deleted the set-master-ip-for-oversubscribed-testing branch February 14, 2025 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants