Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Introduce new tag FI_TAG_RPC #10792

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

jxiong
Copy link

@jxiong jxiong commented Feb 15, 2025

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric.

*FI_TAG_RPC*
: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match
RPC requests and replies. When specified via fi_getinfo, the caller ensures that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match RPC replies with requests.

: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match
RPC requests and replies. When specified via fi_getinfo, the caller ensures that
a reply buffer with the corresponding tag is registered when sending a request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a receive buffer with a corresponding tag to receive an RPC reply has been posted prior sending an RPC request. As a result, unexpected message handling may be optimized.

This mechanism enables libfabric to identify and discard stale replies, preventing
them from interfering with new communications. This is crucial to avoid blocking
a restarting endpoint, which may otherwise lack sufficient metadata to process
incoming messages with unmatched tags.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When FI_TAG_RPC is specified, unmatched received messages should be identified as stale replies and discarded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide if the correct operation is to silently discard the message or report a failed receive operation to the user. It's worth noting that this new tag format may end up defining behavior requirements which differ from the generic tag formats. That is, depending on how this is defined FI_TAG_RPC may not be converted by the provider to FI_TAG_BITS behavior.

I'm okay with this sort of change. It just differs from how FI_TAG_MPI and FI_TAG_CCL are defined. FI_TAG_RPC basically becomes a new feature which a provider needs to support. Either DAOS would need to be responsible for falling back to FI_TAG_BITS, if we allow fi_getinfo() to fail. Or, we can decide to allow fi_getinfo() to succeed, but return FI_TAG_BITS in place of FI_TAG_RPC, and require the caller to check the results. (I would lean towards fi_getinfo() failing.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide if the correct operation is to silently discard the message or report a failed receive operation to the user. It's worth noting that this new tag format may end up defining behavior requirements which differ from the generic tag formats. That is, depending on how this is defined FI_TAG_RPC may not be converted by the provider to FI_TAG_BITS behavior.

I'm okay with this sort of change. It just differs from how FI_TAG_MPI and FI_TAG_CCL are defined. FI_TAG_RPC basically becomes a new feature which a provider needs to support. Either DAOS would need to be responsible for falling back to FI_TAG_BITS, if we allow fi_getinfo() to fail. Or, we can decide to allow fi_getinfo() to succeed, but return FI_TAG_BITS in place of FI_TAG_RPC, and require the caller to check the results. (I would lean towards fi_getinfo() failing.)

Yeah we can check if this flag is correctly returned in mercury library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a matching user posted receive, there's no great way to report an error. (I was thinking about truncating the message and reporting some unmatched error, which could be useful for debugging.) But there's no awesome way to do this.

So, I think the message needs to be silently discarded (except for logging an error).

: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match RPC
replies with requests. When specified via fi_getinfo, the caller ensures that
a receive buffer with a corresponding tag to receive an RPC reply has been posted
prior sending an RPC request. As a result, unexpected message handling may be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please squash the commits and do a force push.

@soumagne
Copy link
Contributor

@j-xiong can we hold off on this PR for now? I think we need more time to think it through. I am not too keen on pushing for a new libfabric feature if we don't use it in the future and instead end up with simply using fi_msg for both RPC requests and responses. Also it sounds like potentially there could be more to it if FI_DIRECTED_RECV is also used ?
We should prototype something first and see what makes more sense.

Copy link
Contributor

@iziemba iziemba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR. This PR is highlighting a part of a big issue being that libfabric RDM EPs may not be optimized for storage usage. I have created issue #10798 for the bigger problem.

I think this PR is related to issue 2 in #10798. The following is content for issue 2.

Storage may use tagged buffers to identify unique RPCs. Due to the async nature of posting tagged receive buffers, there is a potential race where a source issues a tagged send before the tagged receive buffer is processed by hardware/provider. Providers need to implement some mechanism to handle this unexpected message case.

Because these operations are async, immediately failing the libfabric unexpected message is the wrong answer. If this is the only option a provider could support, it seems like a synchronous tagged send/recv API is needed by the provider and this would come at a performance cost (e.g., no more pipelining tagged receive operations).

In addition, buffering unexpected tagged sends at a target may be problematic due to resource leak. For example, a client unexpectedly rebooted will lose all context (tag bits) of inflight RPCs. But, the server is aware of the RPC and issues a tagged send as an RPC response back to client. Since the client has no idea about this tagged operation, the stale tagged send would be buffered "indefinitely".

What is worse is if this operation eventually matched a posted tagged receive, unexpected behavior may happen. If the libfabric user carefully manages tagged bits, tagged receives should not match a stale unexpected tagged send.

I think this PR is on the right track to solving issue 2 in #10798. But, I think we want to ensure the solution handles this async nature of send/recv APIs.

In addition, is using a new mem_tag_format the best option? It seems like this behavior could apply to all existing mem tags.

Comment on lines +925 to +927
replies with requests. When specified via fi_getinfo, the caller ensures that
a receive buffer with a corresponding tag to receive an RPC reply has been posted
prior to sending an RPC request. As a result, unexpected message handling may be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can callers ensure that a receive buffer is posted? The data transfer APIs are asynchronous. In theory, there is a race here which could result in fi_tsend being issued before provider/HW has processed fi_trecv. In addition, should this race be hit with FI_ORDER_SAS, API contract may be broken.

Copy link
Author

@jxiong jxiong Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know the trecv APIs are async. As far as I can see, it's not async for the TCP provider.

This sounds like a serious issue to me. Because if there exists providers who require this for zerocopy or DMA, this behavior will make zerocopy/DMA impossible since the buffer is not ready while data comes. @shefty @iziemba can you share some more insights? It's unlikely for me to work on that but it is still interesting to know how you would approach it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that once fi_trecv() or fi_recv() return, that the buffers are available. This is needed by apps trying to implement flow control. If fi_trecv() or fi_recv() just mean that the buffer may someday in the future be made available, we're almost certainly making life impossible for apps. No credit-based flow control will ever work

The completion of a posted receive buffer is asynchronous. The posting should not be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unexpected messaging closes the race between a buffer receive post and an incoming message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that the unexpected buffers would be unlimited and nobody knows how long they would stay in that buffer.

Coming back to this issue, if the fi_addr is not configured when the unexpected messages comes, it will disable polling the socket fd, which will choke the tcp connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can envision an RPC which receives multiple responses for a single request. Using tags for that might be difficult if the response sizes differ greatly. (So, gRPC probably shouldn't be force fit into MPI semantics.)

For a given RPC, if the server is streaming multiple responses, libfabric tagged messaging with FI_TAGGED_MULTI_RECV seems like a good fit. If a user does not want to associate an RPC with a tag, FI_MSG with FI_MULTI_RECV may be the best.

Having the app request unexpected messages be dropped exposes provider implementation details.
That feels off. The API isn't capturing the application semantics correctly but is instead offering some work-around.

Looking at this from a different angle, resource mgmt discusses what to do if no RX buffer exists. Could this be viewed as an extension to resource mgmt?

Consider that 1 of the issues is that an unexpected message may be matched with the wrong request. Dropping the unexpected message helps, but isn't there a race here? What if the 'unexpected' message didn't arrive until it was 'expected'?

A provider does not have enough context to know if a tag is stale or not. This seems to be a user problem. For example, if Mercury uses a incrementing counter to assign tags to RPCs, it seems like there is potential race of a "stale" tag inflight being reassigned (e.g., client DAOS reloads and counter starts at zero). If they were to also include a unique ID (or something similar) along with 32-bit incrementing counter, this may protect against stale tags.

Copy link
Author

@jxiong jxiong Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand there might be other complications but this is certainly helping us. I wonder if we could do it the current way and we can always come back to improve it when new information is discovered. @shefty @iziemba as long as there are something to debate, it wouldn't be terribly wrong or an obvious mistake, lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mem_tag_format is to be used, part of the FI_TAG_RPC needs to address the format. I think Mercury today uses FI_TAG_BITS. FI_TAG_RPC could be defined to be the following:

FI_TAG_RPC extends FI_TAG_BITS by indicating that tags are being utilized to match RPC replies with requests. When specified, the caller ensures that a receive buffer with a corresponding tag to receive an RPC reply has been posted prior to sending an RPC request. Unmatched tagged received messages must be discarded.

The number of supported tag bits follows the FI_TAG_BITS definition.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been out for about a week. IMO, it would be nice if tag_rpc could also define more semantic meaning around the tag format. For example, ignore bits aren't being used.

I also think it's worth considering this from the angle of resource management, but that's harder to fit a solution into without redefining resource_mgmt (it's currently an enum that's a boolean for practical purposes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iziemba I think FI_TAG_RPC is closer to FI_TAG_CCL in the sense that exact match is required. Note that both FI_TAG_MPI and FI_TAG_CCL are derivatives of FI_TAG_BITS. They all use leading 0s to indicate number of reserved bits.

Comment on lines 930 to 931
When FI_TAG_RPC is specified, unmatched received messages should be identified
as stale replies and discarded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the target, I think discarded is the the right behavior. What should be the return status at the source? I think this needs to be considered as a failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it just does silently discard. If we can reliably identify stale RPC, discarding it seems to be acceptable, no?

If it returns an error to upper layer, there is pretty much nothing the upper layer can do.

@iziemba thank you for raising all those questions. What do you suggest to improve or address all those? Maybe limiting this within the tcp provider?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the target, discarding the receive and not generating a event seems like the right behavior.

At the source, the fi_tsend() must generate a completion event (assuming events are not suppressed). Because the target unexpectedly dropped the receive, I think a CQ error event should be generated for this fi_tsend().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary focus for FI_TAG_RPC should be relative to the desired application semantics and requirements. I would expand the definition to indicate if any source is used or not (guessing no) and the use of wild cards (also guessing not needed -- meaning exact match on tag only).

If a provider can't optimize for the semantic or support it, that's acceptable. But the semantic should be driven by the application.

I believe there are 2 issues, only one is related to tcp. The first issue is that the tcp provider may not detect if a socket has been disconnected. This needs to be handled within the provider. Maybe periodically re-enabling polling for cleanup purposes would work. I'm not sure, as there may still be data sitting on the socket waiting to be read.

The other issue is that the MPI API is ludicrous and requires infinite storage of out-of-order unexpected messages. The problem is that the provider has no way of knowing if a pile of unexpected messages will someday be needed or not. Even if the tcp provider could detect that a socket has been disconnected, it still wouldn't be able to discard received data. The app may still call for it, as happens with mpi.

Because the second issue leads to the first, I don't think this can be handled internally in the provider without application guidance. AFAICT, all providers would hit into this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iziemba @jxiong Send side completion depends on the completion flag. If delivery completion is asked for, an error CQE should be generated. However, if only transmission completion is asked for, the send could have already completed before this unexpected message handling happens. For the RPC case, delivery completion is most likely needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shefty that the definition should clarify the any source and wild card usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failed match requires FI_MATCH_COMPLETE. FI_DELIVERY_COMPLETE allows buffering at the receiver. In general, I think the send operation will end up reporting success.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense on FI_TRANSMIT_COMPLETE, FI_DELIEVERY_COMPLETE, and FI_MATCH_COMPLETE.

This new mem_tag_format tells libfabric tha tags are being used
to match RPC requests and replies. Therefore, messages with unmatched
tags are stale and can be discarded by libfabric.

Signed-off-by: Jinshan Xiong <[email protected]>
@j-xiong
Copy link
Contributor

j-xiong commented Mar 4, 2025

Ideally, the unmatched "stale" message should not exist in the first place. The reason why it happens is that the provider cannot distinguish the new and old peers that happens to use the same address. Imagine the following sequence:
(1) the server has received a bunch of requests from the old client but has not processed them.
(2) the old client dies, the underlying bsock broke.
(3) new client starts, gets the same address as the old client.
(4) the server processes the old requests. it will connect and send the replies to the new client.

Current sockaddr based AV implementation doesn't carry a "generation-id". The CM protocol does use pid for sanity checking, but that information is not exposed to the AV and therefore there is no way to know that a request is from the old client.

This is harder to fix. However, it would be a more proper fix if it can be done.

@soumagne
Copy link
Contributor

soumagne commented Mar 5, 2025

@j-xiong I believe unfortunately that this is not only about identifying connections. For instance, even without having processes or connections interrupted you could imagine a scenario where:

  1. Client sends RPC to server and preposts a recv for the response with tag = 1.
  2. For some reason, RPC times out (server could be busy, network slow etc) and RPC gets canceled, meaning the recv for the response gets canceled and there is no more recv preposted with tag = 1.
  3. Server wakes up and ultimately sends a tagged response back with tag = 1.
  4. Client receives the response as an unexpected message that remains stale. If it were to retry the RPC, it would be posting a new RPC with tag = 2 so previous response remains in the queue with no way to grab/drop it.
    So really the issue in that particular case is about dropping tagged messages for which no recv has been pre-posted.

@j-xiong
Copy link
Contributor

j-xiong commented Mar 5, 2025

@soumagne Thanks. I agree this is a valid case as well.

@shefty
Copy link
Member

shefty commented Mar 5, 2025

I would ignore the tcp provider implementation as part of this discussion. The provider could be using a connectionless transport. In that case, there isn't an underlying generation-id that can be used. The peer restarts, gets the same address, and just starts sending.

It may be that the app should include a generation-id as part of its protocol (e.g. embedded into the tag), but that's unknown to the provider. Because it's possible that a restarted peer may functionally be the same peer as the previous process (e.g. using some save-update-restart mechanism).

If RPC tag matching used a best match algorithm instead of first match, an unmatched message could be reported to the receiver, which could then decide whether or not to drop it. By best match, I mean try an exact match first, and a wildcard match second. This would essentially give the app direct control over an 'unexpected message queue'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants