-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13074
Comments
I like this. I cannot speak about the components you mentioned unfortunately. Curious if you had some performance numbers to show that enabling the progress thread actually yields a benefit. I wasn't around when this feature was taken out but my hunch is that it never lived up to its expectations. My main concern is possible disturbance of application threads, esp the main thread. If the application has only one thread, the progress thread will compete with that thread for the single core it's bound to. If the thread calling MPI_Init is already bound to a single core (e.g., by OpenMP) then the progress thread is bound to that core as well. Such contention is esp bad for worksharing constructs in OpenMP (for any thread) and task-parallelism (for threads discovering tasks, likely the main thread). Just something to be aware of and that we should document. |
You're right. We've used |
The progress thread should at least be floating, i.e., not bound to any specific core but able to migrate between the cores the process has available. Then the application can enable the progress thread and run with |
I have mitigated feeling about this. Long ago, I was a fan, and among others put significant efforts into making this asynchronous progress thread run efficiently across the entire OMPI software stack. But, it turned out to be a nightmare to run for many (ake. 99.9%) of users because everybody expects magic out of thin air, without understanding the drawbacks, the resources conflicts, potential memory traffic reductions and so on. Thus, while the idea is good, I don't think we should let this run without very strict constraints. The progress thread shall not just runamok across all available cores of the process (or it is bound to collide with node-level runtimes). It should only run if specific resources are dedicated to it, or if it is forced to run along with the process (overwriting any resource constraints). Is should be easy to turn it on and off during execution, and should remain quiet during periods of time where nothing is expected to be progressed. That being said we are running some HPC applications with a simulacre of this (thread spawned on a shim PMPI layer, with very precise bindings complementary to the app and spinning on a request until the end). In some cases we see some performance gains, rather minor. But we also see performance degradation, especially during compute phases that are memory bound. Turning it off, or drastically reducing the frequency of polling when there are no ongoing communications (aka no posted non-blocking communications), seems to be the right way to go. |
Thanks @bosilca for sharing your experience. I can see the efforts and struggles spent on this feature by reading its git history. Anyway, I'd really appreciate if you (and other maintainers) can give us some recommendations about the below scope of this RFC:
|
Purely FWIW: this subject was most recently raised (for me) when the CORAL machines were delivered to LLNL and ORNL some years ago. In those cases, the OS was holding back a core on each node for progress threads, and we were required to bind all such threads to that core. So we added an envar to PMIx and PRRTE (which have progress threads) that allows you to stipulate the core(s) to use for those threads. I understand other systems are considering similar arrangements (or already have configuration options for that purpose) - might be worth extending your envar to support such environments. |
Introduction
This RFC aims at discussion about this work on re-enabling the SW-based OPAL async progress thread in order to support the same feature as
MPICH_ASYNC_PROGRESS
. It follows-up the discussion in #9759.To give more details about context and scope of this RFC:
- OPAL async progress thread was initiated in the past, however it has never been officially functional and enabled. Notice that
opal_progress()
is today always called by the main thread.- There exists some use-cases of intra-node MPI applications which are doing lots of non-blocking communication/computation (e.g.
MPI_Ireduce
), which can be accelerated by using spare HW threads to runopal_progress()
in background.In this context, we revamp this functionality in the following patch d52a8388 The main goal of this work is to improve intra-node MPI applications. Typical use-case may be applications using power-of-two MPI processes while the underlying CPU has a non power-of-two available cores, thus some spare cores are available. Another possible use-case may be hyper-threading.
We evaluated the performance of this patch on a single-node CPU-only system (Grace ARM CPU) and obtained a speedup upto x1.4 on
OSU_Ireduce
.Note: Our goal is not to compete with NIC-based offloading in inter-node communication, but to provide support for full CPU applications.
Implementation summary
--enable-progress-threads (default=disabled)
to set theOPAL_ENABLE_PROGRESS_THREADS
macro--enable-progress-threads
:OPAL_ASYNC_PROGRESS=1
)opal_progress()
is renamed to astatic _opal_progress()
opal_progress_init()
will spawn a thread to loop on_opal_progress()
and store the number of processed events in an atomic counter. The main thread will then read-and-swap this counter to zero.opal_progress()
wrapper is introduced (always executed by thread 0), and only read the returned counter by the background thread. IfOPAL_ASYNC_PROGRESS=0
at runtime, theopal_progress()
wrapper will therefore call_opal_progress()
and yields the same behavior as before.MPI_Test[any,some,all]
, the compile-time macro check of#if OPAL_ENABLE_PROGRESS_THREADS == 0
was replaced by a runtime check on a booleanif (!opal_async_progress_thread_spawned)
to callopal_progress()
(since the feature can be configure-enabled, but runtime-disabled).opal_async_progress_thread_spawned = OPAL_ENABLE_PROGRESS_THREADS && (env(OPAL_ASYNC_PROGRESS) != 0)
Unknown impact
OPAL_ENABLE_PROGRESS_THREADS
on other components and would like to have feedback from maintainers of these modules:opal/mca/btl/smcuda/btl_smcuda_component.c
whenOPAL_ENABLE_PROGRESS_THREADS=1
: one extra thread formca_btl_smcuda
?oshmem/runtime/oshmem_shmem_init.c
and the potential (currently disabled)shmem_opal_thread()
?opal_progress()
.In case this contribution is considered valuable, I can open a PR.
The text was updated successfully, but these errors were encountered: