-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Open MPI support asynchronous message progression? #9759
Comments
I was also researching this for a few hours earlier today, and could not find an answer, and would also be very interested in the current status of this. We have just written a new code that was designed for compute/communication overlap using MPI_Isend/MPI_Irecv + MPI_Waitall only to find out that no background progress is occuring (OMPI + UCX w/ Mellanox NICs). In my findings, MVAPICH, MPICH and Intel MPI all seem to feature this async progress thread, and I found it odd that scarcely any information was available for Open MPI. |
@bosilca - Would you know if generic progress threads (as opposed to BTL-specific) are still functional somehow? |
I do not recall the code being removed, so it might be still there, albeit non-functional. I will try to take a look. |
Did anybody work this one out? |
Did anyone come up with a solution for this? |
Hi - as far as UCX, it progresses only asynchronous events (eg. connect/disconnect), the message progress Isend, Irecv are done via calls to ucp_worker_progress(). So the asynchronous thread should come from upper level opal/ompi, and it doesn’t look like we support it. |
@janjust - your answer is regarding progress threads on the UCX level, right? In the absence of such an OMPI-level progress thread - how can async. communication (e.g. |
It would be up to the user to implement calls to ompi wait routines which I think would call into opal progress. |
If communication progress only happens during wait calls - does it still qualify as computation/communication overlapping? Maybe it's just me, but I think the answer is no. :( |
I think you could also use MPI_Test to engage the progress engine. Something like
Still non-ideal but would allow for some more overlap than just an MPI_Wait |
Yes, looks like using With OMPI seeming to have dropped support for async. progress threads - looks like this is the next best thing. |
@MichaelLaufer in your example with MPI_Isend/Irecv, what is the message size? |
Just to clarify, Open MPI has not dropped support for progress threads. The old macro Be aware that there are costs associated with low-latency-network progression threads. These costs don't typically impact large message performance, but they can definitely affect short message performance (e.g., latency). Hence, such progress threads need to be implemented quite carefully. Make sure that these threads do not impact the latency of any other network transport, for example. Finally, it's somewhat of a misnomer to state that Open MPI with UCX doesn't have "asynchronous progress". There are multiple types of asynchronous progress, at least some of which are implemented by firmware and/or hardware. I think that this issue is specifically asking about asynchronous progress of large messages that have to be fragmented by the software layer. Perhaps NVIDIA people can explain how that works in UCX, and/or whether there is asynchronous progress for all fragments in a large message in UCX and/or Open MPI's support for UCX. |
👍 |
@yosefe @jsquyres Thank you both for your answers, it really cleared things up, and delegating the progress to the networking library does make sense (my bad for mistaking it for lack of support in OMPI). If @MichaelLaufer does see significant improvement as a result of |
No. Asynchronous communication progress works best (i.e. enables 100% overlap while sustaining high-performance) when dedicated DMA and compute engines are available and leveraged by middleware. NIC-based HW tag matching available in ConnectX-5 and higher allows UCX to fully offload the implementation of the rendezvous protocol by using the HW tag-matching engine to complete the RTS handling and allow the sender to initiate the RDMA transfer without explicit calls into the MPI library or UCX library on either send or recv side. Once you post Isend/Irecv the entire protocol is handled by the network, no further calls into library progress are needed, simply test for completion sometime later. Open MPI + UCX + UCC is designed to leverage network offloads (RDMA, HW tag-matching, HW multicast, SHARP, DPUs) to the greatest extent possible in order to realize true asynchronous progress in the context of non-blocking data transfers and collective operations overlapped with application workload. |
How do UCX+mlnx hw currently handle "smaller" non-blocking messages? Is tag matching involved with smaller messages? |
I think 1K is the threshold, anything smaller is handled in SW. @yosefe
|
yes, by default messages smaller than 1kb are handled in SW. To enable HW tag matching for small messages this threshold needs to be set to the desired value. But this may degrade performance, because there is some extra overhead when using HW TM and the benefit of avoiding extra memory copy with small messages in negligible. |
On Tue, Apr 12, 2022 at 2:15 PM Mikhail Brinskiy ***@***.***> wrote:
yes, by default messages smaller than 1kb are handled in SW. To enable HW
tag matching for small messages this threshold needs to be set to the
desired value. But this may degrade performance, because there is some
extra overhead when using HW TM and the benefit of avoiding extra memory
copy with small messages in negligible.
—
Reply to this email directly, view it on GitHub
<#9759 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJWTA37WPYTEPZPSGHLJRDVEXDWHANCNFSM5JUGYXTQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
Thank you guys for the feedback!
The bigger question is if smaller messages in non-blocking MPI calls are
progressing regardless of the h/w assist at the HCAs. Let's suppose that an app
needs to send out small status or action messages and have these delivered
as soon as possible. There could be 2 modes, one for favoring BW, and another one favoring shortest
delivery time.
I understand that the smaller the message the higher the overhead per byte
to deliver it but on occasion apps benefit from immediate delivery.
Another issue for apps is the ability to overlap computer with non blocking
communication as much as possible even for small messages over shared-memory transports. Is shared-memory progress handled differently from inter-node communications progress ?
Thanks
|
A similar question as this has been raised by one of the users on our cluster, but this time in reference to processes on the same node, where the communication happens through UCX via shared memory. Their application uses an Isend+Irecv->compute->Wait model, and while they see the expected compute/communication overlap when the processes are on different nodes (so through UCX+IB), it seems there's no overlap for the same node case: the communication only happens at the end. Is there a way to get similar asynchronous progress for the shared memory case with OpenMPI+UCX? Our UCX libraries are configured with all of the "accelerated" shared memory transports (cma, knem, xpmem). Or is the best way to just periodically call In comparison, using MPICH+UCX, while significantly slower than OpenMPI normally, becomes the faster when |
Actually this is our experience and we would like to let any communication advance as early as possible without waiting for some global event to advance it forward.
We are also looking into inter-GPU communications via OpenMPI/UCX hence shorter messages advancing is important.
Regards
…On Sat, Jun 18, 2022 at 12:12 PM Yiltan ***@***.***> wrote:
Closed #9759 <#9759> as completed.
—
Reply to this email directly, view it on GitHub
<#9759 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJWTA7RGVYC3QT4OS5G7ZTVPX7Q7ANCNFSM5JUGYXTQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Dear all, We have a POC of reviving a SW async progress thread in OMPI. A RFC is opened in #13074 to share the diff and gather comments and feedbacks from the community. Please feel free to give it a look, we are open to further discussion. |
Is there a way to configure OMPI to launch a helper process to aid with message progression?
I see this old comment here:
ompi/config/opal_configure_options.m4
Line 513 in f6c84cd
But I cant seem to work out what OMPI's new way to do it is. I'm specifically looking to use UCX if possible.
The text was updated successfully, but these errors were encountered: