Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.22.x] prov/efa: backport changes #10673

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions man/fi_efa.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,11 @@ for details.
: Use device's unsolicited write recv functionality when it's available. (Default: 1).
Setting this environment variable to 0 can disable this feature.

*FI_EFA_INTERNAL_RX_REFILL_THRESHOLD*
: The threshold that EFA provider will refill the internal rx pkt pool. (Default: 8).
When the number of internal rx pkts to post is lower than this threshold,
the refill will be skipped.

# SEE ALSO

[`fabric`(7)](fabric.7.html),
Expand Down
18 changes: 18 additions & 0 deletions prov/efa/docs/efa_rdm_protocol_v4.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,12 @@ Chapter 4 "extra features/requests" describes the extra features/requests define

* Section 4.6 describe the extra feature: RDMA-Write based message transfer.

* Section 4.7 describe the extra feature: Long read and runting read nack protocol.

* Section 4.8 describe the extra feature: User receive QP.

* Section 4.9 describe the extra feature: Unsolicited write recv.

Chapter 5 "What's not covered?" describes the contents that are intentionally left out of
this document because they are considered "implementation details".

Expand Down Expand Up @@ -323,6 +329,7 @@ Table: 2.1 a list of extra features/requests
| 5 | RDMA-Write based data transfer | extra feature | libfabric 1.18.0 | Section 4.6 |
| 6 | Read nack packets | extra feature | libfabric 1.20.0 | Section 4.7 |
| 7 | User recv QP | extra feature & request| libfabric 1.22.0 | Section 4.8 |
| 8 | Unsolicited write recv | extra feature | libfabric 1.22.0 | Section 4.9 |

How does protocol v4 maintain backward compatibility when extra features/requests are introduced?

Expand Down Expand Up @@ -1611,6 +1618,17 @@ zero-copy receive mode.
If a receiver gets RTM packets delivered to its default QP, it raises an error
because it requests all RTM packets must be delivered to its user recv QP.

### 4.9 Unsolicited write recv

The "Unsolicited write recv" is an extra feature that was
introduced with the libfabric 1.22.0. When this feature is on, rdma-write
with immediate data will not consume an rx buffer on the responder side. It is
defined as an extra feature because there is a set of requirements (firmware,
EFA kernel module and rdma-core) to be met before an endpoint can use the unsolicited
write recv capability, therefore an endpoint cannot assume the other party supports
unsolicited write recv. The rdma-write with immediate data cannot be issued if there
is a discrepancy on this feature between local and peer.

## 5. What's not covered?

The purpose of this document is to define the communication protocol. Therefore, it is intentionally written
Expand Down
4 changes: 4 additions & 0 deletions prov/efa/src/efa_env.c
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ struct efa_env efa_env = {
.use_sm2 = false,
.huge_page_setting = EFA_ENV_HUGE_PAGE_UNSPEC,
.use_unsolicited_write_recv = 1,
.internal_rx_refill_threshold = 8,
};

/**
Expand Down Expand Up @@ -132,6 +133,7 @@ void efa_env_param_get(void)
&efa_mr_max_cached_size);
fi_param_get_size_t(&efa_prov, "tx_size", &efa_env.tx_size);
fi_param_get_size_t(&efa_prov, "rx_size", &efa_env.rx_size);
fi_param_get_size_t(&efa_prov, "internal_rx_refill_threshold", &efa_env.internal_rx_refill_threshold);
fi_param_get_bool(&efa_prov, "rx_copy_unexp",
&efa_env.rx_copy_unexp);
fi_param_get_bool(&efa_prov, "rx_copy_ooo",
Expand Down Expand Up @@ -232,6 +234,8 @@ void efa_env_define()
"will use huge page unless FI_EFA_FORK_SAFE is set to 1/on/true.");
fi_param_define(&efa_prov, "use_unsolicited_write_recv", FI_PARAM_BOOL,
"Use device's unsolicited write recv functionality when it's available. (Default: true)");
fi_param_define(&efa_prov, "internal_rx_refill_threshold", FI_PARAM_SIZE_T,
"The threshold that EFA provider will refill the internal rx pkt pool. (Default: %zu)", efa_env.internal_rx_refill_threshold);
}


Expand Down
6 changes: 6 additions & 0 deletions prov/efa/src/efa_env.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ struct efa_env {
int use_sm2;
enum efa_env_huge_page_setting huge_page_setting;
int use_unsolicited_write_recv;
/**
* The threshold that EFA provider will refill the internal rx pkt pool.
* When the number of internal rx pkts to post is lower than this threshold,
* the refill will be skipped.
*/
size_t internal_rx_refill_threshold;
};

extern struct efa_env efa_env;
Expand Down
8 changes: 8 additions & 0 deletions prov/efa/src/rdm/efa_rdm_ep.h
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,8 @@ struct efa_domain *efa_rdm_ep_domain(struct efa_rdm_ep *ep)

void efa_rdm_ep_post_internal_rx_pkts(struct efa_rdm_ep *ep);

int efa_rdm_ep_bulk_post_internal_rx_pkts(struct efa_rdm_ep *ep);

/**
* @brief return whether this endpoint should write error cq entry for RNR.
*
Expand Down Expand Up @@ -446,4 +448,10 @@ static inline int efa_rdm_attempt_to_sync_memops_ioc(struct efa_rdm_ep *ep, stru
return err;
}

static inline
bool efa_rdm_ep_support_unsolicited_write_recv(struct efa_rdm_ep *ep)
{
return ep->extra_info[0] & EFA_RDM_EXTRA_FEATURE_UNSOLICITED_WRITE_RECV;
}

#endif
3 changes: 3 additions & 0 deletions prov/efa/src/rdm/efa_rdm_ep_fiops.c
Original file line number Diff line number Diff line change
Expand Up @@ -1022,6 +1022,9 @@ void efa_rdm_ep_set_extra_info(struct efa_rdm_ep *ep)

ep->extra_info[0] |= EFA_RDM_EXTRA_FEATURE_DELIVERY_COMPLETE;

if (efa_rdm_use_unsolicited_write_recv())
ep->extra_info[0] |= EFA_RDM_EXTRA_FEATURE_UNSOLICITED_WRITE_RECV;

if (ep->use_zcpy_rx) {
/*
* When zcpy rx is enabled, an extra QP is created to
Expand Down
6 changes: 5 additions & 1 deletion prov/efa/src/rdm/efa_rdm_ep_utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -743,7 +743,11 @@ int efa_rdm_ep_bulk_post_internal_rx_pkts(struct efa_rdm_ep *ep)
{
int i, err;

if (ep->efa_rx_pkts_to_post == 0)
/**
* When efa_env.internal_rx_refill_threshold > efa_rdm_ep_get_rx_pool_size(ep),
* we should always refill when the pool is empty.
*/
if (ep->efa_rx_pkts_to_post < MIN(efa_env.internal_rx_refill_threshold, efa_rdm_ep_get_rx_pool_size(ep)))
return 0;

assert(ep->efa_rx_pkts_to_post + ep->efa_rx_pkts_posted <= ep->efa_max_outstanding_rx_ops);
Expand Down
17 changes: 17 additions & 0 deletions prov/efa/src/rdm/efa_rdm_peer.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,23 @@ bool efa_rdm_peer_support_rdma_write(struct efa_rdm_peer *peer)
(peer->extra_info[0] & EFA_RDM_EXTRA_FEATURE_RDMA_WRITE);
}

/**
* @brief check for peer's unsolicited write support, assuming HANDSHAKE has already occurred
*
* @param[in] peer A peer which we have already received a HANDSHAKE from
* @return bool The peer's unsolicited write recv support
*/
static inline
bool efa_rdm_peer_support_unsolicited_write_recv(struct efa_rdm_peer *peer)
{
/* Unsolicited write recv is an extra feature defined in version 4 (the base version).
* Because it is an extra feature, an EP will assume the peer does not support
* it before a handshake packet was received.
*/
return (peer->flags & EFA_RDM_PEER_HANDSHAKE_RECEIVED) &&
(peer->extra_info[0] & EFA_RDM_EXTRA_FEATURE_UNSOLICITED_WRITE_RECV);
}

static inline
bool efa_rdm_peer_support_delivery_complete(struct efa_rdm_peer *peer)
{
Expand Down
3 changes: 2 additions & 1 deletion prov/efa/src/rdm/efa_rdm_protocol.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ struct efa_ep_addr {
#define EFA_RDM_EXTRA_FEATURE_RDMA_WRITE BIT_ULL(5)
#define EFA_RDM_EXTRA_FEATURE_READ_NACK BIT_ULL(6)
#define EFA_RDM_EXTRA_FEATURE_REQUEST_USER_RECV_QP BIT_ULL(7)
#define EFA_RDM_NUM_EXTRA_FEATURE_OR_REQUEST 8
#define EFA_RDM_EXTRA_FEATURE_UNSOLICITED_WRITE_RECV BIT_ULL(8)
#define EFA_RDM_NUM_EXTRA_FEATURE_OR_REQUEST 9
/*
* The length of 64-bit extra_info array used in efa_rdm_ep
* and efa_rdm_peer
Expand Down
18 changes: 18 additions & 0 deletions prov/efa/src/rdm/efa_rdm_rma.c
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,24 @@ ssize_t efa_rdm_rma_post_write(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe)
return efa_rdm_ep_enforce_handshake_for_txe(ep, txe);

if (efa_rdm_rma_should_write_using_rdma(ep, txe, txe->peer)) {
/**
* Unsolicited write recv is a feature that makes rdma-write with
* imm not consume an rx buffer on the responder side, and this
* feature requires consistent support status on both sides.
*/
if ((txe->fi_flags & FI_REMOTE_CQ_DATA) &&
(efa_rdm_ep_support_unsolicited_write_recv(ep) != efa_rdm_peer_support_unsolicited_write_recv(txe->peer))) {
(void) efa_rdm_construct_msg_with_local_and_peer_information(ep, txe->addr, ep->err_msg, "", EFA_RDM_ERROR_MSG_BUFFER_LENGTH);
EFA_WARN(FI_LOG_EP_DATA,
"Inconsistent support status detected on unsolicited write recv.\n"
"My support status: %d, peer support status: %d. %s.\n"
"This is usually caused by inconsistent efa driver, libfabric, or rdma-core versions.\n"
"Please use consistent software versions on both hosts, or disable the unsolicited write "
"recv feature by setting environment variable FI_EFA_USE_UNSOLICITED_WRITE_RECV=0\n",
efa_rdm_use_unsolicited_write_recv(), efa_rdm_peer_support_unsolicited_write_recv(txe->peer),
ep->err_msg);
return -FI_EOPNOTSUPP;
}
efa_rdm_ope_prepare_to_post_write(txe);
return efa_rdm_ope_post_remote_write(txe);
}
Expand Down
91 changes: 57 additions & 34 deletions prov/efa/src/rdm/efa_rdm_util.c
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,53 @@ void efa_rdm_get_desc_for_shm(int numdesc, void **efa_desc, void **shm_desc)
}
}

/**
* @brief Construct a message that contains the local and peer information,
* including the efa address and the host id.
*
* @param ep EFA RDM endpoint
* @param addr Remote peer fi_addr_t
* @param msg the ptr of the msg to be constructed (needs to be allocated already!)
* @param base_msg ptr to the base msg that will show at the beginning of msg
* @param msg_len the length of the message
* @return int 0 on success, negative integer on failure
*/
int efa_rdm_construct_msg_with_local_and_peer_information(struct efa_rdm_ep *ep, fi_addr_t addr, char *msg, const char *base_msg, size_t msg_len)
{
char ep_addr_str[OFI_ADDRSTRLEN] = {0}, peer_addr_str[OFI_ADDRSTRLEN] = {0};
char peer_host_id_str[EFA_HOST_ID_STRING_LENGTH + 1] = {0};
char local_host_id_str[EFA_HOST_ID_STRING_LENGTH + 1] = {0};
size_t len = 0;
int ret;
struct efa_rdm_peer *peer = efa_rdm_ep_get_peer(ep, addr);

len = sizeof(ep_addr_str);
efa_rdm_ep_raw_addr_str(ep, ep_addr_str, &len);
len = sizeof(peer_addr_str);
efa_rdm_ep_get_peer_raw_addr_str(ep, addr, peer_addr_str, &len);

if (!ep->host_id || EFA_HOST_ID_STRING_LENGTH != snprintf(local_host_id_str, EFA_HOST_ID_STRING_LENGTH + 1, "i-%017lx", ep->host_id)) {
strcpy(local_host_id_str, "N/A");
}

if (!peer->host_id || EFA_HOST_ID_STRING_LENGTH != snprintf(peer_host_id_str, EFA_HOST_ID_STRING_LENGTH + 1, "i-%017lx", peer->host_id)) {
strcpy(peer_host_id_str, "N/A");
}

ret = snprintf(msg, msg_len, "%s My EFA addr: %s My host id: %s Peer EFA addr: %s Peer host id: %s",
base_msg, ep_addr_str, local_host_id_str, peer_addr_str, peer_host_id_str);

if (ret < 0 || ret > msg_len - 1) {
return -FI_EINVAL;
}

if (strlen(msg) >= msg_len) {
return -FI_ENOBUFS;
}

return FI_SUCCESS;
}

/**
* @brief Write the error message and return its byte length
* @param[in] ep EFA RDM endpoint
Expand All @@ -108,42 +155,18 @@ void efa_rdm_get_desc_for_shm(int numdesc, void **efa_desc, void **shm_desc)
*/
int efa_rdm_write_error_msg(struct efa_rdm_ep *ep, fi_addr_t addr, int prov_errno, void **buf, size_t *buflen)
{
char ep_addr_str[OFI_ADDRSTRLEN] = {0}, peer_addr_str[OFI_ADDRSTRLEN] = {0};
char peer_host_id_str[EFA_HOST_ID_STRING_LENGTH + 1] = {0};
char local_host_id_str[EFA_HOST_ID_STRING_LENGTH + 1] = {0};
const char *base_msg = efa_strerror(prov_errno);
size_t len = 0;
struct efa_rdm_peer *peer = efa_rdm_ep_get_peer(ep, addr);

*buf = NULL;
*buflen = 0;

len = sizeof(ep_addr_str);
efa_rdm_ep_raw_addr_str(ep, ep_addr_str, &len);
len = sizeof(peer_addr_str);
efa_rdm_ep_get_peer_raw_addr_str(ep, addr, peer_addr_str, &len);

if (!ep->host_id || EFA_HOST_ID_STRING_LENGTH != snprintf(local_host_id_str, EFA_HOST_ID_STRING_LENGTH + 1, "i-%017lx", ep->host_id)) {
strcpy(local_host_id_str, "N/A");
}

if (!peer->host_id || EFA_HOST_ID_STRING_LENGTH != snprintf(peer_host_id_str, EFA_HOST_ID_STRING_LENGTH + 1, "i-%017lx", peer->host_id)) {
strcpy(peer_host_id_str, "N/A");
}

int ret = snprintf(ep->err_msg, EFA_RDM_ERROR_MSG_BUFFER_LENGTH, "%s My EFA addr: %s My host id: %s Peer EFA addr: %s Peer host id: %s",
base_msg, ep_addr_str, local_host_id_str, peer_addr_str, peer_host_id_str);
const char *base_msg = efa_strerror(prov_errno);
int ret;

if (ret < 0 || ret > EFA_RDM_ERROR_MSG_BUFFER_LENGTH - 1) {
return -FI_EINVAL;
}
*buf = NULL;
*buflen = 0;

if (strlen(ep->err_msg) >= EFA_RDM_ERROR_MSG_BUFFER_LENGTH) {
return -FI_ENOBUFS;
}
ret = efa_rdm_construct_msg_with_local_and_peer_information(ep, addr, ep->err_msg, base_msg, EFA_RDM_ERROR_MSG_BUFFER_LENGTH);
if (ret)
return ret;

*buf = ep->err_msg;
*buflen = EFA_RDM_ERROR_MSG_BUFFER_LENGTH;
*buf = ep->err_msg;
*buflen = EFA_RDM_ERROR_MSG_BUFFER_LENGTH;

return 0;
return 0;
}
2 changes: 2 additions & 0 deletions prov/efa/src/rdm/efa_rdm_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ bool efa_rdm_get_use_device_rdma(uint32_t fabric_api_version);

void efa_rdm_get_desc_for_shm(int numdesc, void **efa_desc, void **shm_desc);

int efa_rdm_construct_msg_with_local_and_peer_information(struct efa_rdm_ep *ep, fi_addr_t addr, char *msg, const char *base_msg, size_t msg_len);

int efa_rdm_write_error_msg(struct efa_rdm_ep *ep, fi_addr_t addr, int prov_errno, void **buf, size_t *buflen);

#ifdef ENABLE_EFA_POISONING
Expand Down
Loading
Loading