-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpich test failures on s390x #35
Comments
I have an idea of the problem. If MPICH fails and Open-MPI succeeds, then I suspect the MPICH datatypes code is broken. Can you set the MPICH build to also use |
With
There's a small variation in the PMPI function triggering the error. test_mpi_dim references PMPI_Accumulate:
while the other 3 (apart from test_mpi_indexed_gets) reference PMPI_Win_unlock, e.g.
(likewise test_mpi_indexed_puts_gets and test_mpi_subarray_accs) |
Actually, I need to report it might not be so straightforward. When I manually rebuild the original configuration on an s390x porterbox, without adding ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED, I get the same result. The five test_mpi_* tests fail for mpich, the other tests pass. Between the original build test errors and today's tests, our mpich was upgraded from 4.0 to 4.0.1, if that explains why the other tests now pass. Without adding the extra flags, test_mpi_indexed_accs is triggered from PMPI_Accumulate, as before, not from PMPI_Win_unlock
|
Can you try again with |
Hmm, with those settings (without ARMCI_STRIDED_METHOD=IOV) I'm back to 15 failures:
with a touch more error output, just adding a short description of the test
|
If I activate ARMCI_STRIDED_METHOD=IOV alongside ARMCI_IOV_METHOD=CONSRV, ARMCI_IOV_CHECKS=1, ARMCI_SHR_BUF_METHOD=COPY, ARMCI_RMA_NOCHECK=0, and ARMCI_NO_FLUSH_LOCAL=1 then I'm back to the 5 failures. |
A build of armci-mpi with mpich 4.0 fails tests on s390x. Tests pass for Intel and ARM architectures (amd64 and arm64 and their lesser counterparts)
The build log is available at https://buildd.debian.org/status/fetch.php?pkg=armci-mpi&arch=s390x&ver=0.3.1%7Ebeta-5&stamp=1645753186&raw=0 .
Tests pass with openmpi but 16 tests fail with mpich:
Further details of the errors are listed in the build log
There are essentially only two test errors here. Most of these failures all point at the same error
e.g.
looputil.c is actually in mpich not armci-mpi, maybe this is an mpich bug?
Not sure if it's relevant to looputil.c l.813 here, but we caught a bug in incorrect assumptions about how long double alignment was implemented on s390x, exposed in mpi4py, see mpi4py/mpi4py#91
The other error is in test_mpi_indexed_gets:
I see an error like this if there is a mismatch in libmpich.so (e.g. on amd64, running armci-mpi tests with libarmci built against mpich 4.0 but then compiling tests using libmpich1.2 from mpich 3.4.1), but that kind of mismatch shouldn't apply to the s390x build-time test failure reported here.
For reference, various tests also fail at build time for other less common architectures, evidently for different reasons. Build logs are collected at https://buildd.debian.org/status/package.php?p=armci-mpi
On mips64el, test_mpi_indexed_gets fails on mpich, all tests pass with openmpi. On mipsel tests pass with mpich but fail with openmpi.
CI runtime (installation) test logs are collected at https://ci.debian.net/packages/a/armci-mpi/ (the version building with mpich is 0.3.1~beta-5 or later), showing the same test failure on s390x.
The text was updated successfully, but these errors were encountered: