Does Open MPI support asynchronous message progression? #9759

Yiltan · 2021-12-08T16:23:31Z

Is there a way to configure OMPI to launch a helper process to aid with message progression?

I see this old comment here:

Line 513 in f6c84cd

dnl We no longer support the old OPAL_ENABLE_PROGRESS_THREADS. At

But I cant seem to work out what OMPI's new way to do it is. I'm specifically looking to use UCX if possible.

MichaelLaufer · 2021-12-09T19:32:07Z

I was also researching this for a few hours earlier today, and could not find an answer, and would also be very interested in the current status of this. We have just written a new code that was designed for compute/communication overlap using MPI_Isend/MPI_Irecv + MPI_Waitall only to find out that no background progress is occuring (OMPI + UCX w/ Mellanox NICs).

In my findings, MVAPICH, MPICH and Intel MPI all seem to feature this async progress thread, and I found it odd that scarcely any information was available for Open MPI.

alex--m · 2021-12-11T10:43:08Z

@bosilca - Would you know if generic progress threads (as opposed to BTL-specific) are still functional somehow?

bosilca · 2021-12-13T16:23:52Z

I do not recall the code being removed, so it might be still there, albeit non-functional. I will try to take a look.

Yiltan · 2022-01-31T20:25:56Z

Did anybody work this one out?

arstgr · 2022-02-25T00:07:32Z

Did anyone come up with a solution for this?

janjust · 2022-02-25T14:21:30Z

Hi - as far as UCX, it progresses only asynchronous events (eg. connect/disconnect), the message progress Isend, Irecv are done via calls to ucp_worker_progress(). So the asynchronous thread should come from upper level opal/ompi, and it doesn’t look like we support it.
To add a bit more, if HW resources are available and a message can be immediately posted then HW progression is "asynchronous" to the software layer, but there's no UCX progress thread to check the HW queue and keep it full (on the sender side) if there's pending send activity queued up in the software side.

alex--m · 2022-02-25T15:08:42Z

@janjust - your answer is regarding progress threads on the UCX level, right?
I think the original approach was to have a progress thread on OMPI's level, which would periodically call internal MPI progress (e.g. opal_progress) - which in turn would also call UCX's explicit progress function (i.e. ucp_worker_progress).

In the absence of such an OMPI-level progress thread - how can async. communication (e.g. MPI_Isend) overlap with computation?

janjust · 2022-02-25T15:17:03Z

It would be up to the user to implement calls to ompi wait routines which I think would call into opal progress.

alex--m · 2022-02-25T15:21:51Z

If communication progress only happens during wait calls - does it still qualify as computation/communication overlapping? Maybe it's just me, but I think the answer is no. :(

Yiltan · 2022-02-25T15:24:15Z

I think you could also use MPI_Test to engage the progress engine.

Something like

while(!flag) {
  do_work();
  MPI_Test(request, flag);
}

Still non-ideal but would allow for some more overlap than just an MPI_Wait

alex--m · 2022-02-25T15:43:41Z

Yes, looks like using MPI_Test will make OMPI call opal_progress if the request is not ready... though it won't help applications not written/modified to use this technique, unfortunately.

With OMPI seeming to have dropped support for async. progress threads - looks like this is the next best thing.

yosefe · 2022-02-25T15:46:05Z

@MichaelLaufer in your example with MPI_Isend/Irecv, what is the message size?
Can you try to set UCX_RC_MLX5_TM_ENABLE=y (assuming ConnectX-5 or newer NIC)?

jsquyres · 2022-02-25T16:04:56Z

Just to clarify, Open MPI has not dropped support for progress threads.

The old macro OPAL_ENABLE_PROGRESS_THREADS is a red herring. There are several low-impact progress threads in MPI processes today (e.g., PMIx, libevent, etc.). Network layers are free to implement their own progress threads if they want to.

Be aware that there are costs associated with low-latency-network progression threads. These costs don't typically impact large message performance, but they can definitely affect short message performance (e.g., latency). Hence, such progress threads need to be implemented quite carefully. Make sure that these threads do not impact the latency of any other network transport, for example.

Finally, it's somewhat of a misnomer to state that Open MPI with UCX doesn't have "asynchronous progress". There are multiple types of asynchronous progress, at least some of which are implemented by firmware and/or hardware. I think that this issue is specifically asking about asynchronous progress of large messages that have to be fragmented by the software layer. Perhaps NVIDIA people can explain how that works in UCX, and/or whether there is asynchronous progress for all fragments in a large message in UCX and/or Open MPI's support for UCX.

yosefe · 2022-02-25T16:10:58Z

Finally, it's somewhat of a misnomer to state that Open MPI with UCX doesn't have "asynchronous progress". There are multiple types of asynchronous progress, at least some of which are implemented by firmware and/or hardware. I think that this issue is specifically asking about asynchronous progress of large messages that have to be fragmented by the software layer. Perhaps NVIDIA people can explain how that works in UCX, and/or whether there is asynchronous progress for all fragments in a large message in UCX and/or Open MPI's support for UCX.

👍
With UCX and Mellanox NICs, OpenMPI progress thread is probably not the right approach; we'd like to offload to HW as much as possible.
Large messages would use rendezvous protocol that is fragmented to packets by RDMA HW, but there could be a problem of RTS received after MPI_Irecv ("expected") and data fetch not started before going to MPI_Wait. That's what UCX_RC_MLX5_TM_ENABLE=y is aimed to address.

alex--m · 2022-02-25T16:40:43Z

@yosefe @jsquyres Thank you both for your answers, it really cleared things up, and delegating the progress to the networking library does make sense (my bad for mistaking it for lack of support in OMPI).

If @MichaelLaufer does see significant improvement as a result of UCX_RC_MLX5_TM_ENABLE=y - it might also be helpful for others if this kind of tuning could be proposed, when applicable. I mean if this kind of "missing progress" could be detected (e.g. when rendezvous control messages arrive during wait, as opposed to other blocking P2P calls) - I'm sure users would appreciate a hint from UCX about UCX_RC_MLX5_TM_ENABLE...

jladd-mlnx · 2022-02-25T18:06:01Z

If communication progress only happens during wait calls - does it still qualify as computation/communication overlapping? Maybe it's just me, but I think the answer is no. :(

No. Asynchronous communication progress works best (i.e. enables 100% overlap while sustaining high-performance) when dedicated DMA and compute engines are available and leveraged by middleware. NIC-based HW tag matching available in ConnectX-5 and higher allows UCX to fully offload the implementation of the rendezvous protocol by using the HW tag-matching engine to complete the RTS handling and allow the sender to initiate the RDMA transfer without explicit calls into the MPI library or UCX library on either send or recv side. Once you post Isend/Irecv the entire protocol is handled by the network, no further calls into library progress are needed, simply test for completion sometime later.

Open MPI + UCX + UCC is designed to leverage network offloads (RDMA, HW tag-matching, HW multicast, SHARP, DPUs) to the greatest extent possible in order to realize true asynchronous progress in the context of non-blocking data transfers and collective operations overlapped with application workload.

drmichaelt7777 · 2022-04-12T15:19:22Z

How do UCX+mlnx hw currently handle "smaller" non-blocking messages? Is tag matching involved with smaller messages?

janjust · 2022-04-12T15:47:30Z

I think 1K is the threshold, anything smaller is handled in SW. @yosefe

#

# Threshold for using tag matching offload capabilities.

# Smaller buffers will not be posted to the transport.

#

# syntax:    memory units: <number>[b|kb|mb|gb], "inf", or "auto"

#

UCX_TM_THRESH=1K

brminich · 2022-04-12T19:15:36Z

yes, by default messages smaller than 1kb are handled in SW. To enable HW tag matching for small messages this threshold needs to be set to the desired value. But this may degrade performance, because there is some extra overhead when using HW TM and the benefit of avoiding extra memory copy with small messages in negligible.

drmichaelt7777 · 2022-04-12T23:53:46Z

On Tue, Apr 12, 2022 at 2:15 PM Mikhail Brinskiy ***@***.***> wrote: yes, by default messages smaller than 1kb are handled in SW. To enable HW tag matching for small messages this threshold needs to be set to the desired value. But this may degrade performance, because there is some extra overhead when using HW TM and the benefit of avoiding extra memory copy with small messages in negligible. — Reply to this email directly, view it on GitHub <#9759 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJWTA37WPYTEPZPSGHLJRDVEXDWHANCNFSM5JUGYXTQ> . You are receiving this because you commented.Message ID: ***@***.***>

Thank you guys for the feedback! The bigger question is if smaller messages in non-blocking MPI calls are progressing regardless of the h/w assist at the HCAs. Let's suppose that an app needs to send out small status or action messages and have these delivered as soon as possible. There could be 2 modes, one for favoring BW, and another one favoring shortest delivery time. I understand that the smaller the message the higher the overhead per byte to deliver it but on occasion apps benefit from immediate delivery. Another issue for apps is the ability to overlap computer with non blocking communication as much as possible even for small messages over shared-memory transports. Is shared-memory progress handled differently from inter-node communications progress ? Thanks

benmenadue · 2022-06-13T22:11:25Z

A similar question as this has been raised by one of the users on our cluster, but this time in reference to processes on the same node, where the communication happens through UCX via shared memory. Their application uses an Isend+Irecv->compute->Wait model, and while they see the expected compute/communication overlap when the processes are on different nodes (so through UCX+IB), it seems there's no overlap for the same node case: the communication only happens at the end.

Is there a way to get similar asynchronous progress for the shared memory case with OpenMPI+UCX? Our UCX libraries are configured with all of the "accelerated" shared memory transports (cma, knem, xpmem). Or is the best way to just periodically call MPI_Test as part of the compute period?

In comparison, using MPICH+UCX, while significantly slower than OpenMPI normally, becomes the faster when MPICH_ASYNC_PROGRESS is enabled.

drmichaelt7777 · 2022-06-19T00:26:00Z

Actually this is our experience and we would like to let any communication advance as early as possible without waiting for some global event to advance it forward. We are also looking into inter-GPU communications via OpenMPI/UCX hence shorter messages advancing is important. Regards

…

On Sat, Jun 18, 2022 at 12:12 PM Yiltan ***@***.***> wrote: Closed #9759 <#9759> as completed. — Reply to this email directly, view it on GitHub <#9759 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJWTA7RGVYC3QT4OS5G7ZTVPX7Q7ANCNFSM5JUGYXTQ> . You are receiving this because you commented.Message ID: ***@***.***>

hominhquan · 2025-01-31T10:18:16Z

Dear all,

We have a POC of reviving a SW async progress thread in OMPI. A RFC is opened in #13074 to share the diff and gather comments and feedbacks from the community. Please feel free to give it a look, we are open to further discussion.

bosilca self-assigned this Dec 13, 2021

Yiltan closed this as completed Jun 18, 2022

hominhquan mentioned this issue Jan 31, 2025

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13074

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Open MPI support asynchronous message progression? #9759

Does Open MPI support asynchronous message progression? #9759

Yiltan commented Dec 8, 2021

MichaelLaufer commented Dec 9, 2021 •

edited

Loading

alex--m commented Dec 11, 2021

bosilca commented Dec 13, 2021

Yiltan commented Jan 31, 2022

arstgr commented Feb 25, 2022

janjust commented Feb 25, 2022 •

edited

Loading

alex--m commented Feb 25, 2022

janjust commented Feb 25, 2022

alex--m commented Feb 25, 2022

Yiltan commented Feb 25, 2022

alex--m commented Feb 25, 2022

yosefe commented Feb 25, 2022

jsquyres commented Feb 25, 2022

yosefe commented Feb 25, 2022

alex--m commented Feb 25, 2022

jladd-mlnx commented Feb 25, 2022

drmichaelt7777 commented Apr 12, 2022

janjust commented Apr 12, 2022

brminich commented Apr 12, 2022

drmichaelt7777 commented Apr 12, 2022 via email •

edited

Loading

benmenadue commented Jun 13, 2022

drmichaelt7777 commented Jun 19, 2022 via email •

edited

Loading

hominhquan commented Jan 31, 2025

Does Open MPI support asynchronous message progression? #9759

Does Open MPI support asynchronous message progression? #9759

Comments

Yiltan commented Dec 8, 2021

MichaelLaufer commented Dec 9, 2021 • edited Loading

alex--m commented Dec 11, 2021

bosilca commented Dec 13, 2021

Yiltan commented Jan 31, 2022

arstgr commented Feb 25, 2022

janjust commented Feb 25, 2022 • edited Loading

alex--m commented Feb 25, 2022

janjust commented Feb 25, 2022

alex--m commented Feb 25, 2022

Yiltan commented Feb 25, 2022

alex--m commented Feb 25, 2022

yosefe commented Feb 25, 2022

jsquyres commented Feb 25, 2022

yosefe commented Feb 25, 2022

alex--m commented Feb 25, 2022

jladd-mlnx commented Feb 25, 2022

drmichaelt7777 commented Apr 12, 2022

janjust commented Apr 12, 2022

brminich commented Apr 12, 2022

drmichaelt7777 commented Apr 12, 2022 via email • edited Loading

benmenadue commented Jun 13, 2022

drmichaelt7777 commented Jun 19, 2022 via email • edited Loading

hominhquan commented Jan 31, 2025

MichaelLaufer commented Dec 9, 2021 •

edited

Loading

janjust commented Feb 25, 2022 •

edited

Loading

drmichaelt7777 commented Apr 12, 2022 via email •

edited

Loading

drmichaelt7777 commented Jun 19, 2022 via email •

edited

Loading