core: Introduce new tag FI_TAG_RPC #10792

jxiong · 2025-02-15T01:41:39Z

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric.

shefty · 2025-02-15T02:09:55Z

man/fi_endpoint.3.md

+*FI_TAG_RPC*
+
+: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match
+  RPC requests and replies. When specified via fi_getinfo, the caller ensures that


match RPC replies with requests.

shefty · 2025-02-15T02:12:27Z

man/fi_endpoint.3.md

+
+: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match
+  RPC requests and replies. When specified via fi_getinfo, the caller ensures that
+  a reply buffer with the corresponding tag is registered when sending a request.


a receive buffer with a corresponding tag to receive an RPC reply has been posted prior sending an RPC request. As a result, unexpected message handling may be optimized.

shefty · 2025-02-15T02:15:26Z

man/fi_endpoint.3.md

+  This mechanism enables libfabric to identify and discard stale replies, preventing
+  them from interfering with new communications. This is crucial to avoid blocking
+  a restarting endpoint, which may otherwise lack sufficient metadata to process
+  incoming messages with unmatched tags.


When FI_TAG_RPC is specified, unmatched received messages should be identified as stale replies and discarded.

We need to decide if the correct operation is to silently discard the message or report a failed receive operation to the user. It's worth noting that this new tag format may end up defining behavior requirements which differ from the generic tag formats. That is, depending on how this is defined FI_TAG_RPC may not be converted by the provider to FI_TAG_BITS behavior.

I'm okay with this sort of change. It just differs from how FI_TAG_MPI and FI_TAG_CCL are defined. FI_TAG_RPC basically becomes a new feature which a provider needs to support. Either DAOS would need to be responsible for falling back to FI_TAG_BITS, if we allow fi_getinfo() to fail. Or, we can decide to allow fi_getinfo() to succeed, but return FI_TAG_BITS in place of FI_TAG_RPC, and require the caller to check the results. (I would lean towards fi_getinfo() failing.)

We need to decide if the correct operation is to silently discard the message or report a failed receive operation to the user. It's worth noting that this new tag format may end up defining behavior requirements which differ from the generic tag formats. That is, depending on how this is defined FI_TAG_RPC may not be converted by the provider to FI_TAG_BITS behavior.

I'm okay with this sort of change. It just differs from how FI_TAG_MPI and FI_TAG_CCL are defined. FI_TAG_RPC basically becomes a new feature which a provider needs to support. Either DAOS would need to be responsible for falling back to FI_TAG_BITS, if we allow fi_getinfo() to fail. Or, we can decide to allow fi_getinfo() to succeed, but return FI_TAG_BITS in place of FI_TAG_RPC, and require the caller to check the results. (I would lean towards fi_getinfo() failing.)

Yeah we can check if this flag is correctly returned in mercury library.

Without a matching user posted receive, there's no great way to report an error. (I was thinking about truncating the message and reporting some unmatched error, which could be useful for debugging.) But there's no awesome way to do this.

So, I think the message needs to be silently discarded (except for logging an error).

j-xiong · 2025-02-18T00:31:26Z

man/fi_endpoint.3.md

+: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match RPC
+  replies with requests. When specified via fi_getinfo, the caller ensures that
+  a receive buffer with a corresponding tag to receive an RPC reply has been posted
+  prior sending an RPC request.  As a result, unexpected message handling may be


Also please squash the commits and do a force push.

soumagne · 2025-02-18T16:22:03Z

@j-xiong can we hold off on this PR for now? I think we need more time to think it through. I am not too keen on pushing for a new libfabric feature if we don't use it in the future and instead end up with simply using fi_msg for both RPC requests and responses. Also it sounds like potentially there could be more to it if FI_DIRECTED_RECV is also used ?
We should prototype something first and see what makes more sense.

iziemba

Thanks for opening this PR. This PR is highlighting a part of a big issue being that libfabric RDM EPs may not be optimized for storage usage. I have created issue #10798 for the bigger problem.

I think this PR is related to issue 2 in #10798. The following is content for issue 2.

Storage may use tagged buffers to identify unique RPCs. Due to the async nature of posting tagged receive buffers, there is a potential race where a source issues a tagged send before the tagged receive buffer is processed by hardware/provider. Providers need to implement some mechanism to handle this unexpected message case.

Because these operations are async, immediately failing the libfabric unexpected message is the wrong answer. If this is the only option a provider could support, it seems like a synchronous tagged send/recv API is needed by the provider and this would come at a performance cost (e.g., no more pipelining tagged receive operations).

In addition, buffering unexpected tagged sends at a target may be problematic due to resource leak. For example, a client unexpectedly rebooted will lose all context (tag bits) of inflight RPCs. But, the server is aware of the RPC and issues a tagged send as an RPC response back to client. Since the client has no idea about this tagged operation, the stale tagged send would be buffered "indefinitely".

What is worse is if this operation eventually matched a posted tagged receive, unexpected behavior may happen. If the libfabric user carefully manages tagged bits, tagged receives should not match a stale unexpected tagged send.

I think this PR is on the right track to solving issue 2 in #10798. But, I think we want to ensure the solution handles this async nature of send/recv APIs.

In addition, is using a new mem_tag_format the best option? It seems like this behavior could apply to all existing mem tags.

include/rdma/fabric.h

iziemba · 2025-02-19T13:32:59Z

man/fi_endpoint.3.md

+  replies with requests. When specified via fi_getinfo, the caller ensures that
+  a receive buffer with a corresponding tag to receive an RPC reply has been posted
+  prior to sending an RPC request.  As a result, unexpected message handling may be


How can callers ensure that a receive buffer is posted? The data transfer APIs are asynchronous. In theory, there is a race here which could result in fi_tsend being issued before provider/HW has processed fi_trecv. In addition, should this race be hit with FI_ORDER_SAS, API contract may be broken.

I didn't know the trecv APIs are async. As far as I can see, it's not async for the TCP provider.

This sounds like a serious issue to me. Because if there exists providers who require this for zerocopy or DMA, this behavior will make zerocopy/DMA impossible since the buffer is not ready while data comes. @shefty @iziemba can you share some more insights? It's unlikely for me to work on that but it is still interesting to know how you would approach it.

The assumption is that once fi_trecv() or fi_recv() return, that the buffers are available. This is needed by apps trying to implement flow control. If fi_trecv() or fi_recv() just mean that the buffer may someday in the future be made available, we're almost certainly making life impossible for apps. No credit-based flow control will ever work

The completion of a posted receive buffer is asynchronous. The posting should not be.

I think unexpected messaging closes the race between a buffer receive post and an incoming message.

The concern is that the unexpected buffers would be unlimited and nobody knows how long they would stay in that buffer.

Coming back to this issue, if the fi_addr is not configured when the unexpected messages comes, it will disable polling the socket fd, which will choke the tcp connection.

I can envision an RPC which receives multiple responses for a single request. Using tags for that might be difficult if the response sizes differ greatly. (So, gRPC probably shouldn't be force fit into MPI semantics.)

For a given RPC, if the server is streaming multiple responses, libfabric tagged messaging with FI_TAGGED_MULTI_RECV seems like a good fit. If a user does not want to associate an RPC with a tag, FI_MSG with FI_MULTI_RECV may be the best.

Having the app request unexpected messages be dropped exposes provider implementation details.
That feels off. The API isn't capturing the application semantics correctly but is instead offering some work-around.

Looking at this from a different angle, resource mgmt discusses what to do if no RX buffer exists. Could this be viewed as an extension to resource mgmt?

Consider that 1 of the issues is that an unexpected message may be matched with the wrong request. Dropping the unexpected message helps, but isn't there a race here? What if the 'unexpected' message didn't arrive until it was 'expected'?

A provider does not have enough context to know if a tag is stale or not. This seems to be a user problem. For example, if Mercury uses a incrementing counter to assign tags to RPCs, it seems like there is potential race of a "stale" tag inflight being reassigned (e.g., client DAOS reloads and counter starts at zero). If they were to also include a unique ID (or something similar) along with 32-bit incrementing counter, this may protect against stale tags.

I understand there might be other complications but this is certainly helping us. I wonder if we could do it the current way and we can always come back to improve it when new information is discovered. @shefty @iziemba as long as there are something to debate, it wouldn't be terribly wrong or an obvious mistake, lol

If mem_tag_format is to be used, part of the FI_TAG_RPC needs to address the format. I think Mercury today uses FI_TAG_BITS. FI_TAG_RPC could be defined to be the following:

FI_TAG_RPC extends FI_TAG_BITS by indicating that tags are being utilized to match RPC replies with requests. When specified, the caller ensures that a receive buffer with a corresponding tag to receive an RPC reply has been posted prior to sending an RPC request. Unmatched tagged received messages must be discarded.

The number of supported tag bits follows the FI_TAG_BITS definition.

Thoughts?

I've been out for about a week. IMO, it would be nice if tag_rpc could also define more semantic meaning around the tag format. For example, ignore bits aren't being used.

I also think it's worth considering this from the angle of resource management, but that's harder to fit a solution into without redefining resource_mgmt (it's currently an enum that's a boolean for practical purposes).

@iziemba I think FI_TAG_RPC is closer to FI_TAG_CCL in the sense that exact match is required. Note that both FI_TAG_MPI and FI_TAG_CCL are derivatives of FI_TAG_BITS. They all use leading 0s to indicate number of reserved bits.

iziemba · 2025-02-19T13:34:10Z

man/fi_endpoint.3.md

+  When FI_TAG_RPC is specified, unmatched received messages should be identified
+  as stale replies and discarded.


At the target, I think discarded is the the right behavior. What should be the return status at the source? I think this needs to be considered as a failure.

Right now it just does silently discard. If we can reliably identify stale RPC, discarding it seems to be acceptable, no?

If it returns an error to upper layer, there is pretty much nothing the upper layer can do.

@iziemba thank you for raising all those questions. What do you suggest to improve or address all those? Maybe limiting this within the tcp provider?

At the target, discarding the receive and not generating a event seems like the right behavior.

At the source, the fi_tsend() must generate a completion event (assuming events are not suppressed). Because the target unexpectedly dropped the receive, I think a CQ error event should be generated for this fi_tsend().

The primary focus for FI_TAG_RPC should be relative to the desired application semantics and requirements. I would expand the definition to indicate if any source is used or not (guessing no) and the use of wild cards (also guessing not needed -- meaning exact match on tag only).

If a provider can't optimize for the semantic or support it, that's acceptable. But the semantic should be driven by the application.

I believe there are 2 issues, only one is related to tcp. The first issue is that the tcp provider may not detect if a socket has been disconnected. This needs to be handled within the provider. Maybe periodically re-enabling polling for cleanup purposes would work. I'm not sure, as there may still be data sitting on the socket waiting to be read.

The other issue is that the MPI API is ludicrous and requires infinite storage of out-of-order unexpected messages. The problem is that the provider has no way of knowing if a pile of unexpected messages will someday be needed or not. Even if the tcp provider could detect that a socket has been disconnected, it still wouldn't be able to discard received data. The app may still call for it, as happens with mpi.

Because the second issue leads to the first, I don't think this can be handled internally in the provider without application guidance. AFAICT, all providers would hit into this issue.

@iziemba @jxiong Send side completion depends on the completion flag. If delivery completion is asked for, an error CQE should be generated. However, if only transmission completion is asked for, the send could have already completed before this unexpected message handling happens. For the RPC case, delivery completion is most likely needed.

I agree with @shefty that the definition should clarify the any source and wild card usage.

A failed match requires FI_MATCH_COMPLETE. FI_DELIVERY_COMPLETE allows buffering at the receiver. In general, I think the send operation will end up reporting success.

Makes sense on FI_TRANSMIT_COMPLETE, FI_DELIEVERY_COMPLETE, and FI_MATCH_COMPLETE.

man/fi_endpoint.3.md

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric. Signed-off-by: Jinshan Xiong <[email protected]>

j-xiong · 2025-03-04T23:30:44Z

Ideally, the unmatched "stale" message should not exist in the first place. The reason why it happens is that the provider cannot distinguish the new and old peers that happens to use the same address. Imagine the following sequence:
(1) the server has received a bunch of requests from the old client but has not processed them.
(2) the old client dies, the underlying bsock broke.
(3) new client starts, gets the same address as the old client.
(4) the server processes the old requests. it will connect and send the replies to the new client.

Current sockaddr based AV implementation doesn't carry a "generation-id". The CM protocol does use pid for sanity checking, but that information is not exposed to the AV and therefore there is no way to know that a request is from the old client.

This is harder to fix. However, it would be a more proper fix if it can be done.

soumagne · 2025-03-05T05:34:07Z

@j-xiong I believe unfortunately that this is not only about identifying connections. For instance, even without having processes or connections interrupted you could imagine a scenario where:

Client sends RPC to server and preposts a recv for the response with tag = 1.
For some reason, RPC times out (server could be busy, network slow etc) and RPC gets canceled, meaning the recv for the response gets canceled and there is no more recv preposted with tag = 1.
Server wakes up and ultimately sends a tagged response back with tag = 1.
Client receives the response as an unexpected message that remains stale. If it were to retry the RPC, it would be posting a new RPC with tag = 2 so previous response remains in the queue with no way to grab/drop it.
So really the issue in that particular case is about dropping tagged messages for which no recv has been pre-posted.

j-xiong · 2025-03-05T06:42:21Z

@soumagne Thanks. I agree this is a valid case as well.

shefty · 2025-03-05T17:32:12Z

I would ignore the tcp provider implementation as part of this discussion. The provider could be using a connectionless transport. In that case, there isn't an underlying generation-id that can be used. The peer restarts, gets the same address, and just starts sending.

It may be that the app should include a generation-id as part of its protocol (e.g. embedded into the tag), but that's unknown to the provider. Because it's possible that a restarted peer may functionally be the same peer as the previous process (e.g. using some save-update-restart mechanism).

If RPC tag matching used a best match algorithm instead of first match, an unmatched message could be reported to the receiver, which could then decide whether or not to drop it. By best match, I mean try an exact match first, and a wildcard match second. This would essentially give the app direct control over an 'unexpected message queue'.

jxiong mentioned this pull request Feb 15, 2025

prov/tcp: drop stale messages for new endpoint #10783

Open

shefty reviewed Feb 15, 2025

View reviewed changes

j-xiong reviewed Feb 18, 2025

View reviewed changes

jxiong force-pushed the fi_tag_rpc branch from f33f29b to ec458eb Compare February 18, 2025 04:20

j-xiong added evaluating ⚠️ Do not merge labels Feb 18, 2025

iziemba mentioned this pull request Feb 19, 2025

Improve libfabric RDM EP for storage #10798

Open

iziemba requested changes Feb 19, 2025

View reviewed changes

iziemba reviewed Feb 21, 2025

View reviewed changes

man/fi_endpoint.3.md Outdated Show resolved Hide resolved

core: Introduce new tag FI_TAG_RPC

e9d1ce8

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric. Signed-off-by: Jinshan Xiong <[email protected]>

jxiong force-pushed the fi_tag_rpc branch from ec458eb to e9d1ce8 Compare February 21, 2025 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Introduce new tag FI_TAG_RPC #10792

core: Introduce new tag FI_TAG_RPC #10792

jxiong commented Feb 15, 2025

shefty Feb 15, 2025

shefty Feb 15, 2025

shefty Feb 15, 2025

shefty Feb 15, 2025

jxiong Feb 15, 2025

shefty Feb 15, 2025

j-xiong Feb 18, 2025

j-xiong Feb 18, 2025

soumagne commented Feb 18, 2025

iziemba left a comment

iziemba Feb 19, 2025

jxiong Feb 19, 2025 •

edited

Loading

shefty Feb 19, 2025

swelch Feb 19, 2025

jxiong Feb 19, 2025

iziemba Feb 25, 2025

jxiong Feb 28, 2025 •

edited

Loading

iziemba Feb 28, 2025

shefty Mar 4, 2025

j-xiong Mar 5, 2025

iziemba Feb 19, 2025

jxiong Feb 19, 2025

iziemba Feb 20, 2025

shefty Feb 20, 2025

j-xiong Feb 20, 2025

j-xiong Feb 20, 2025

shefty Feb 20, 2025

iziemba Feb 21, 2025

j-xiong commented Mar 4, 2025

soumagne commented Mar 5, 2025

j-xiong commented Mar 5, 2025

shefty commented Mar 5, 2025

		When FI_TAG_RPC is specified, unmatched received messages should be identified
		as stale replies and discarded.

core: Introduce new tag FI_TAG_RPC #10792

Are you sure you want to change the base?

core: Introduce new tag FI_TAG_RPC #10792

Conversation

jxiong commented Feb 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soumagne commented Feb 18, 2025

iziemba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jxiong Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jxiong Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-xiong commented Mar 4, 2025

soumagne commented Mar 5, 2025

j-xiong commented Mar 5, 2025

shefty commented Mar 5, 2025

jxiong Feb 19, 2025 •

edited

Loading

jxiong Feb 28, 2025 •

edited

Loading