Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Do not use shm peer provider when neuron or synapseai devices exist #9624

Closed

Conversation

a-szegel
Copy link
Contributor

@a-szegel a-szegel commented Dec 6, 2023

UT Results on trn1:

cat /home/ec2-user/PortaFiducia/tests/test_suites/libfabric/efa_unit_test.xml
<?xml version="1.0" encoding="UTF-8" ?>
<testsuites>
  <testsuite name="efa unit tests" time="9.869" tests="63" failures="0" errors="0" skipped="2" >
    <testcase name="test_av_insert_duplicate_raw_addr" time="0.097" >
    </testcase>
    <testcase name="test_av_insert_duplicate_gid" time="0.017" >
    </testcase>
    <testcase name="test_efa_device_construct_error_handling" time="0.001" >
    </testcase>
    <testcase name="test_efa_rdm_ep_ignore_missing_host_id_file" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_has_valid_host_id" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_ignore_short_host_id" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_ignore_non_hex_host_id" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_handshake_receive_and_send_valid_host_ids_with_connid" time="0.033" >
    </testcase>
    <testcase name="test_efa_rdm_ep_handshake_receive_and_send_valid_host_ids_without_connid" time="0.034" >
    </testcase>
    <testcase name="test_efa_rdm_ep_handshake_receive_valid_peer_host_id_and_do_not_send_local_host_id" time="0.034" >
    </testcase>
    <testcase name="test_efa_rdm_ep_handshake_receive_without_peer_host_id_and_do_not_send_local_host_id" time="0.034" >
    </testcase>
    <testcase name="test_efa_rdm_ep_getopt_undersized_optlen" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_getopt_oversized_optlen" time="0.004" >
    </testcase>
    <testcase name="test_efa_rdm_ep_cq_create_error_handling" time="0.001" >
    </testcase>
    <testcase name="test_efa_rdm_ep_pkt_pool_flags" time="0.003" >
    </testcase>
    <testcase name="test_efa_rdm_ep_pkt_pool_page_alignment" time="0.036" >
    </testcase>
    <testcase name="test_efa_rdm_ep_dc_atomic_error_handling" time="0.021" >
    </testcase>
    <testcase name="test_efa_rdm_ep_send_with_shm_no_copy" time="0.016" >
    </testcase>
    <testcase name="test_efa_rdm_ep_rma_without_caps" time="0.016" >
    </testcase>
    <testcase name="test_efa_rdm_ep_atomic_without_caps" time="0.015" >
    </testcase>
    <testcase name="test_dgram_cq_read_empty_cq" time="0.002" >
    </testcase>
    <testcase name="test_ibv_cq_ex_read_empty_cq" time="0.019" >
    </testcase>
    <testcase name="test_ibv_cq_ex_read_failed_poll" time="0.018" >
    </testcase>
    <testcase name="test_rdm_cq_read_bad_send_status_unresponsive_receiver" time="0.043" >
    </testcase>
    <testcase name="test_rdm_cq_read_bad_send_status_unresponsive_receiver_missing_peer_host_id" time="0.041" >
    </testcase>
    <testcase name="test_rdm_cq_read_bad_send_status_invalid_qpn" time="0.038" >
    </testcase>
    <testcase name="test_rdm_cq_read_bad_send_status_message_too_long" time="0.041" >
    </testcase>
    <testcase name="test_ibv_cq_ex_read_bad_recv_status" time="0.015" >
    </testcase>
    <testcase name="test_ibv_cq_ex_read_recover_forgotten_peer_ah" time="0.033" >
    </testcase>
    <testcase name="test_ibv_cq_ex_read_ignore_removed_peer" time="0.029" >
    </testcase>
    <testcase name="test_rdm_fallback_to_ibv_create_cq_ex_cq_read_ignore_forgotton_peer" time="0.029" >
    </testcase>
    <testcase name="test_info_open_ep_with_wrong_info" time="0.000" >
    </testcase>
    <testcase name="test_info_open_ep_with_api_1_1_info" time="0.001" >
    </testcase>
    <testcase name="test_info_check_shm_info_hmem" time="0.000" >
    </testcase>
    <testcase name="test_info_check_shm_info_op_flags" time="0.000" >
    </testcase>
    <testcase name="test_info_check_shm_info_threading" time="0.000" >
    </testcase>
    <testcase name="test_info_check_hmem_cuda_support_on_api_lt_1_18" time="0.000" >
      <skipped/>
    </testcase>
    <testcase name="test_info_check_hmem_cuda_support_on_api_ge_1_18" time="0.000" >
      <skipped/>
    </testcase>
    <testcase name="test_info_check_no_hmem_support_when_not_requested" time="0.000" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env1_opt1" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env0_opt0" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env1_opt0" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env0_opt1" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env1" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_env0" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_opt1" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_opt0" time="0.001" >
    </testcase>
    <testcase name="test_efa_use_device_rdma_opt_old" time="0.003" >
    </testcase>
    <testcase name="test_efa_hmem_info_update_neuron" time="0.000" >
    </testcase>
    <testcase name="test_efa_srx_min_multi_recv_size" time="0.004" >
    </testcase>
    <testcase name="test_efa_srx_cq" time="0.003" >
    </testcase>
    <testcase name="test_efa_srx_lock" time="0.003" >
    </testcase>
    <testcase name="test_efa_rnr_queue_and_resend" time="0.027" >
    </testcase>
    <testcase name="test_efa_rdm_ope_prepare_to_post_send_with_no_enough_tx_pkts" time="0.017" >
    </testcase>
    <testcase name="test_efa_rdm_ope_prepare_to_post_send_host_memory" time="0.017" >
    </testcase>
    <testcase name="test_efa_rdm_ope_prepare_to_post_send_host_memory_align128" time="0.016" >
    </testcase>
    <testcase name="test_efa_rdm_ope_prepare_to_post_send_cuda_memory" time="0.015" >
    </testcase>
    <testcase name="test_efa_rdm_ope_prepare_to_post_send_cuda_memory_align128" time="0.015" >
    </testcase>
    <testcase name="test_efa_rdm_ope_post_write_0_byte" time="0.022" >
    </testcase>
    <testcase name="test_efa_rdm_msg_send_to_local_peer_with_null_desc" time="0.016" >
    </testcase>
    <testcase name="test_efa_fork_support_request_initialize_when_ibv_fork_support_is_needed" time="0.000" >
    </testcase>
    <testcase name="test_efa_fork_support_request_initialize_when_ibv_fork_support_is_unneeded" time="0.000" >
    </testcase>
    <testcase name="test_efa_hmem_neuron_no_shm" time="9.010" >                <<<<<<<<<<<<<<<<<<< THIS IS WHAT WE CARE ABOUT
    </testcase>
  </testsuite>
</testsuites>

@a-szegel a-szegel requested a review from a team December 6, 2023 00:20
Neuron/SynapseAI do not use SHM for intranode communication.  Disable
SHM when Neuron/SynapseAI devices initialize to save resources, and
polling.

Signed-off-by: Seth Zegelstein <[email protected]>
@a-szegel a-szegel force-pushed the do-not-use-shm-with-neuron-or-synapseai branch from 24fe222 to ce5320d Compare December 6, 2023 01:30
Add assert statements to existing Neuron ut's to make sure that the SHM
provider is not used.

Signed-off-by: Seth Zegelstein <[email protected]>
@a-szegel a-szegel force-pushed the do-not-use-shm-with-neuron-or-synapseai branch from ce5320d to 4d0d297 Compare December 6, 2023 06:16
@a-szegel a-szegel closed this Dec 6, 2023
@a-szegel
Copy link
Contributor Author

a-szegel commented Dec 6, 2023

We need to update our internal testing framework to be able to run unit tests which require Neuron (if this gets re-opened).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant