You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When a new endpoint is created with fi_endpoint(), the verbs provider blocks the caller while the destination address is being resolved (i.e., rdma_resolve_addr()). If the destination address is not reachable, the caller has to wait 2 seconds for the timeout provided to rdma_resolve_addr() to expire. This is a killer for scalability.
/* TODO convert this call to non-blocking (use event channel) as well:
/* TODO convert this call to non-blocking (use event channel) as well:
* This may likely be needed for better scaling when running large
* MPI jobs.
* Making this non-blocking would mean we can't create QP at EP enable
* time. We need to wait for RDMA_CM_EVENT_ADDR_RESOLVED event before
* creating the QP using rdma_create_qp. It would also require a SW
* receive queue to store recvs posted by app after enabling the EP.
*/
if (rdma_resolve_addr(*id, rai->ai_src_addr, rai->ai_dst_addr,
VERBS_RESOLVE_TIMEOUT)) {
ret = -errno;
VRB_WARN_ERRNO(FI_LOG_EP_CTRL, "rdma_resolve_addr");
ofi_straddr_log(&vrb_prov, FI_LOG_WARN, FI_LOG_EP_CTRL,
"src addr", rai->ai_src_addr);
ofi_straddr_log(&vrb_prov, FI_LOG_WARN, FI_LOG_EP_CTRL,
"dst addr", rai->ai_dst_addr);
goto err2;
}
Describe the solution you'd like
The solution is described in the TODO comment above: rdma_resolve_addr() should use an event channel, so it doesn't block the caller.
Describe alternatives you've considered
AFAIK there is no alternative.
Additional context
More context can be provided if needed
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When a new endpoint is created with
fi_endpoint()
, the verbs provider blocks the caller while the destination address is being resolved (i.e.,rdma_resolve_addr()
). If the destination address is not reachable, the caller has to wait 2 seconds for the timeout provided tordma_resolve_addr()
to expire. This is a killer for scalability.Here is the implementation for
vrb_create_ep()
:libfabric/prov/verbs/src/verbs_init.c
Line 338 in c5cace8
Describe the solution you'd like
The solution is described in the TODO comment above:
rdma_resolve_addr()
should use an event channel, so it doesn't block the caller.Describe alternatives you've considered
AFAIK there is no alternative.
Additional context
More context can be provided if needed
The text was updated successfully, but these errors were encountered: