Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Open MPI support asynchronous message progression? #9759

Closed
Yiltan opened this issue Dec 8, 2021 · 23 comments
Closed

Does Open MPI support asynchronous message progression? #9759

Yiltan opened this issue Dec 8, 2021 · 23 comments
Assignees

Comments

@Yiltan
Copy link

Yiltan commented Dec 8, 2021

Is there a way to configure OMPI to launch a helper process to aid with message progression?

I see this old comment here:

dnl We no longer support the old OPAL_ENABLE_PROGRESS_THREADS. At

But I cant seem to work out what OMPI's new way to do it is. I'm specifically looking to use UCX if possible.

@MichaelLaufer
Copy link

MichaelLaufer commented Dec 9, 2021

I was also researching this for a few hours earlier today, and could not find an answer, and would also be very interested in the current status of this. We have just written a new code that was designed for compute/communication overlap using MPI_Isend/MPI_Irecv + MPI_Waitall only to find out that no background progress is occuring (OMPI + UCX w/ Mellanox NICs).

In my findings, MVAPICH, MPICH and Intel MPI all seem to feature this async progress thread, and I found it odd that scarcely any information was available for Open MPI.

@alex--m
Copy link
Contributor

alex--m commented Dec 11, 2021

@bosilca - Would you know if generic progress threads (as opposed to BTL-specific) are still functional somehow?

@bosilca
Copy link
Member

bosilca commented Dec 13, 2021

I do not recall the code being removed, so it might be still there, albeit non-functional. I will try to take a look.

@bosilca bosilca self-assigned this Dec 13, 2021
@Yiltan
Copy link
Author

Yiltan commented Jan 31, 2022

Did anybody work this one out?

@arstgr
Copy link

arstgr commented Feb 25, 2022

Did anyone come up with a solution for this?

@janjust
Copy link
Contributor

janjust commented Feb 25, 2022

Hi - as far as UCX, it progresses only asynchronous events (eg. connect/disconnect), the message progress Isend, Irecv are done via calls to ucp_worker_progress(). So the asynchronous thread should come from upper level opal/ompi, and it doesn’t look like we support it.
To add a bit more, if HW resources are available and a message can be immediately posted then HW progression is "asynchronous" to the software layer, but there's no UCX progress thread to check the HW queue and keep it full (on the sender side) if there's pending send activity queued up in the software side.

@alex--m
Copy link
Contributor

alex--m commented Feb 25, 2022

@janjust - your answer is regarding progress threads on the UCX level, right?
I think the original approach was to have a progress thread on OMPI's level, which would periodically call internal MPI progress (e.g. opal_progress) - which in turn would also call UCX's explicit progress function (i.e. ucp_worker_progress).

In the absence of such an OMPI-level progress thread - how can async. communication (e.g. MPI_Isend) overlap with computation?

@janjust
Copy link
Contributor

janjust commented Feb 25, 2022

It would be up to the user to implement calls to ompi wait routines which I think would call into opal progress.

@alex--m
Copy link
Contributor

alex--m commented Feb 25, 2022

If communication progress only happens during wait calls - does it still qualify as computation/communication overlapping? Maybe it's just me, but I think the answer is no. :(

@Yiltan
Copy link
Author

Yiltan commented Feb 25, 2022

I think you could also use MPI_Test to engage the progress engine.

Something like

while(!flag) {
  do_work();
  MPI_Test(request, flag);
}

Still non-ideal but would allow for some more overlap than just an MPI_Wait

@alex--m
Copy link
Contributor

alex--m commented Feb 25, 2022

Yes, looks like using MPI_Test will make OMPI call opal_progress if the request is not ready... though it won't help applications not written/modified to use this technique, unfortunately.

With OMPI seeming to have dropped support for async. progress threads - looks like this is the next best thing.

@yosefe
Copy link
Contributor

yosefe commented Feb 25, 2022

@MichaelLaufer in your example with MPI_Isend/Irecv, what is the message size?
Can you try to set UCX_RC_MLX5_TM_ENABLE=y (assuming ConnectX-5 or newer NIC)?

@jsquyres
Copy link
Member

Just to clarify, Open MPI has not dropped support for progress threads.

The old macro OPAL_ENABLE_PROGRESS_THREADS is a red herring. There are several low-impact progress threads in MPI processes today (e.g., PMIx, libevent, etc.). Network layers are free to implement their own progress threads if they want to.

Be aware that there are costs associated with low-latency-network progression threads. These costs don't typically impact large message performance, but they can definitely affect short message performance (e.g., latency). Hence, such progress threads need to be implemented quite carefully. Make sure that these threads do not impact the latency of any other network transport, for example.

Finally, it's somewhat of a misnomer to state that Open MPI with UCX doesn't have "asynchronous progress". There are multiple types of asynchronous progress, at least some of which are implemented by firmware and/or hardware. I think that this issue is specifically asking about asynchronous progress of large messages that have to be fragmented by the software layer. Perhaps NVIDIA people can explain how that works in UCX, and/or whether there is asynchronous progress for all fragments in a large message in UCX and/or Open MPI's support for UCX.

@yosefe
Copy link
Contributor

yosefe commented Feb 25, 2022

Finally, it's somewhat of a misnomer to state that Open MPI with UCX doesn't have "asynchronous progress". There are multiple types of asynchronous progress, at least some of which are implemented by firmware and/or hardware. I think that this issue is specifically asking about asynchronous progress of large messages that have to be fragmented by the software layer. Perhaps NVIDIA people can explain how that works in UCX, and/or whether there is asynchronous progress for all fragments in a large message in UCX and/or Open MPI's support for UCX.

👍
With UCX and Mellanox NICs, OpenMPI progress thread is probably not the right approach; we'd like to offload to HW as much as possible.
Large messages would use rendezvous protocol that is fragmented to packets by RDMA HW, but there could be a problem of RTS received after MPI_Irecv ("expected") and data fetch not started before going to MPI_Wait. That's what UCX_RC_MLX5_TM_ENABLE=y is aimed to address.

@alex--m
Copy link
Contributor

alex--m commented Feb 25, 2022

@yosefe @jsquyres Thank you both for your answers, it really cleared things up, and delegating the progress to the networking library does make sense (my bad for mistaking it for lack of support in OMPI).

If @MichaelLaufer does see significant improvement as a result of UCX_RC_MLX5_TM_ENABLE=y - it might also be helpful for others if this kind of tuning could be proposed, when applicable. I mean if this kind of "missing progress" could be detected (e.g. when rendezvous control messages arrive during wait, as opposed to other blocking P2P calls) - I'm sure users would appreciate a hint from UCX about UCX_RC_MLX5_TM_ENABLE...

@jladd-mlnx
Copy link
Member

If communication progress only happens during wait calls - does it still qualify as computation/communication overlapping? Maybe it's just me, but I think the answer is no. :(

No. Asynchronous communication progress works best (i.e. enables 100% overlap while sustaining high-performance) when dedicated DMA and compute engines are available and leveraged by middleware. NIC-based HW tag matching available in ConnectX-5 and higher allows UCX to fully offload the implementation of the rendezvous protocol by using the HW tag-matching engine to complete the RTS handling and allow the sender to initiate the RDMA transfer without explicit calls into the MPI library or UCX library on either send or recv side. Once you post Isend/Irecv the entire protocol is handled by the network, no further calls into library progress are needed, simply test for completion sometime later.

Open MPI + UCX + UCC is designed to leverage network offloads (RDMA, HW tag-matching, HW multicast, SHARP, DPUs) to the greatest extent possible in order to realize true asynchronous progress in the context of non-blocking data transfers and collective operations overlapped with application workload.

@drmichaelt7777
Copy link

How do UCX+mlnx hw currently handle "smaller" non-blocking messages? Is tag matching involved with smaller messages?

@janjust
Copy link
Contributor

janjust commented Apr 12, 2022

I think 1K is the threshold, anything smaller is handled in SW. @yosefe

#

# Threshold for using tag matching offload capabilities.

# Smaller buffers will not be posted to the transport.

#

# syntax:    memory units: <number>[b|kb|mb|gb], "inf", or "auto"

#

UCX_TM_THRESH=1K

@brminich
Copy link
Member

yes, by default messages smaller than 1kb are handled in SW. To enable HW tag matching for small messages this threshold needs to be set to the desired value. But this may degrade performance, because there is some extra overhead when using HW TM and the benefit of avoiding extra memory copy with small messages in negligible.

@drmichaelt7777
Copy link

drmichaelt7777 commented Apr 12, 2022 via email

@benmenadue
Copy link
Contributor

A similar question as this has been raised by one of the users on our cluster, but this time in reference to processes on the same node, where the communication happens through UCX via shared memory. Their application uses an Isend+Irecv->compute->Wait model, and while they see the expected compute/communication overlap when the processes are on different nodes (so through UCX+IB), it seems there's no overlap for the same node case: the communication only happens at the end.

Is there a way to get similar asynchronous progress for the shared memory case with OpenMPI+UCX? Our UCX libraries are configured with all of the "accelerated" shared memory transports (cma, knem, xpmem). Or is the best way to just periodically call MPI_Test as part of the compute period?

In comparison, using MPICH+UCX, while significantly slower than OpenMPI normally, becomes the faster when MPICH_ASYNC_PROGRESS is enabled.

@Yiltan Yiltan closed this as completed Jun 18, 2022
@drmichaelt7777
Copy link

drmichaelt7777 commented Jun 19, 2022 via email

@hominhquan
Copy link

Dear all,

We have a POC of reviving a SW async progress thread in OMPI. A RFC is opened in #13074 to share the diff and gather comments and feedbacks from the community. Please feel free to give it a look, we are open to further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests