Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13074

Open
hominhquan opened this issue Jan 31, 2025 · 6 comments
Open

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13074

hominhquan opened this issue Jan 31, 2025 · 6 comments

Comments

@hominhquan
Copy link

hominhquan commented Jan 31, 2025

Introduction

This RFC aims at discussion about this work on re-enabling the SW-based OPAL async progress thread in order to support the same feature as MPICH_ASYNC_PROGRESS. It follows-up the discussion in #9759.

To give more details about context and scope of this RFC:
- OPAL async progress thread was initiated in the past, however it has never been officially functional and enabled. Notice that opal_progress() is today always called by the main thread.
- There exists some use-cases of intra-node MPI applications which are doing lots of non-blocking communication/computation (e.g. MPI_Ireduce), which can be accelerated by using spare HW threads to run opal_progress() in background.

In this context, we revamp this functionality in the following patch d52a8388 The main goal of this work is to improve intra-node MPI applications. Typical use-case may be applications using power-of-two MPI processes while the underlying CPU has a non power-of-two available cores, thus some spare cores are available. Another possible use-case may be hyper-threading.

We evaluated the performance of this patch on a single-node CPU-only system (Grace ARM CPU) and obtained a speedup upto x1.4 on OSU_Ireduce.

Note: Our goal is not to compete with NIC-based offloading in inter-node communication, but to provide support for full CPU applications.

Implementation summary

  • New configure option --enable-progress-threads (default=disabled) to set the OPAL_ENABLE_PROGRESS_THREADS macro
  • When configured with --enable-progress-threads:
    • The feature is still runtime-disabled (runtime-enabled by setting the env variable OPAL_ASYNC_PROGRESS=1)
    • opal_progress() is renamed to a static _opal_progress()
    • opal_progress_init() will spawn a thread to loop on _opal_progress() and store the number of processed events in an atomic counter. The main thread will then read-and-swap this counter to zero.
    • A new opal_progress() wrapper is introduced (always executed by thread 0), and only read the returned counter by the background thread. If OPAL_ASYNC_PROGRESS=0 at runtime, the opal_progress() wrapper will therefore call _opal_progress() and yields the same behavior as before.
    • In MPI_Test[any,some,all], the compile-time macro check of #if OPAL_ENABLE_PROGRESS_THREADS == 0 was replaced by a runtime check on a boolean if (!opal_async_progress_thread_spawned) to call opal_progress() (since the feature can be configure-enabled, but runtime-disabled).
    • Value of opal_async_progress_thread_spawned = OPAL_ENABLE_PROGRESS_THREADS && (env(OPAL_ASYNC_PROGRESS) != 0)

Unknown impact

  • We are not clear on possible impact on enabling OPAL_ENABLE_PROGRESS_THREADS on other components and would like to have feedback from maintainers of these modules:
    • opal/mca/btl/smcuda/btl_smcuda_component.c when OPAL_ENABLE_PROGRESS_THREADS=1 : one extra thread for mca_btl_smcuda ?
    • oshmem/runtime/oshmem_shmem_init.c and the potential (currently disabled) shmem_opal_thread() ?
    • Is it relevant/feasible to merge these per-component progress threads into a unique one ? given a callback mechanism is already in place in opal_progress().

In case this contribution is considered valuable, I can open a PR.

@devreal
Copy link
Contributor

devreal commented Jan 31, 2025

I like this. I cannot speak about the components you mentioned unfortunately. Curious if you had some performance numbers to show that enabling the progress thread actually yields a benefit. I wasn't around when this feature was taken out but my hunch is that it never lived up to its expectations.

My main concern is possible disturbance of application threads, esp the main thread. If the application has only one thread, the progress thread will compete with that thread for the single core it's bound to. If the thread calling MPI_Init is already bound to a single core (e.g., by OpenMP) then the progress thread is bound to that core as well.

Such contention is esp bad for worksharing constructs in OpenMP (for any thread) and task-parallelism (for threads discovering tasks, likely the main thread). Just something to be aware of and that we should document.

@hominhquan
Copy link
Author

My main concern is possible disturbance of application threads, esp the main thread. If the application has only one thread, the progress thread will compete with that thread for the single core it's bound to. If the thread calling MPI_Init is already bound to a single core (e.g., by OpenMP) then the progress thread is bound to that core as well.

You're right. We've used mpirun ... --map-by slot:PE=$((${OMP_NUM_THREADS}+1)) ... in our execution to add one more PE to each rank. This additional PE is aimed to host the new progress thread.

@devreal
Copy link
Contributor

devreal commented Jan 31, 2025

The progress thread should at least be floating, i.e., not bound to any specific core but able to migrate between the cores the process has available. Then the application can enable the progress thread and run with OMP_NUM_THREADS=$((NCORES-1)). hwloc can be used to manage the binding of the progress thread.

@bosilca
Copy link
Member

bosilca commented Jan 31, 2025

I have mitigated feeling about this. Long ago, I was a fan, and among others put significant efforts into making this asynchronous progress thread run efficiently across the entire OMPI software stack. But, it turned out to be a nightmare to run for many (ake. 99.9%) of users because everybody expects magic out of thin air, without understanding the drawbacks, the resources conflicts, potential memory traffic reductions and so on.

Thus, while the idea is good, I don't think we should let this run without very strict constraints. The progress thread shall not just runamok across all available cores of the process (or it is bound to collide with node-level runtimes). It should only run if specific resources are dedicated to it, or if it is forced to run along with the process (overwriting any resource constraints). Is should be easy to turn it on and off during execution, and should remain quiet during periods of time where nothing is expected to be progressed.

That being said we are running some HPC applications with a simulacre of this (thread spawned on a shim PMPI layer, with very precise bindings complementary to the app and spinning on a request until the end). In some cases we see some performance gains, rather minor. But we also see performance degradation, especially during compute phases that are memory bound. Turning it off, or drastically reducing the frequency of polling when there are no ongoing communications (aka no posted non-blocking communications), seems to be the right way to go.

@hominhquan
Copy link
Author

Thanks @bosilca for sharing your experience. I can see the efforts and struggles spent on this feature by reading its git history.

Anyway, I'd really appreciate if you (and other maintainers) can give us some recommendations about the below scope of this RFC:

  • The patch we did in this RFC do not change the default behavior of the progressing mechanism today (no extra thread and it is always the main thread to call opal_progress()).
  • The scope of this RFC is for single-node MPI applications, where there may be spare CPU resource available (e.g. 64 MPI ranks on a 72-core CPU).
  • Users who run their app on a single-node with spare CPU resources can enable the async progress thread with an env var and only if it was configured at build. And by doing this, they should take additional care about allocation & binding of local MPI processes.
  • We can add more documentation (and perhaps a warning message in mpirun)

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2025

Purely FWIW: this subject was most recently raised (for me) when the CORAL machines were delivered to LLNL and ORNL some years ago. In those cases, the OS was holding back a core on each node for progress threads, and we were required to bind all such threads to that core. So we added an envar to PMIx and PRRTE (which have progress threads) that allows you to stipulate the core(s) to use for those threads.

I understand other systems are considering similar arrangements (or already have configuration options for that purpose) - might be worth extending your envar to support such environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants