Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT]: Enable packed_cas codepath using 16B CAS on sm_90+ architectures #547

Open
sleeepyjack opened this issue Jul 16, 2024 · 2 comments
Labels
topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function

Comments

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Jul 16, 2024

Is your feature request related to a problem? Please describe.

The packed_cas update routine shows better performance compared to back_to_back_cas and cas_dependent_write.

On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.

Describe the solution you'd like

16B atomicCAS was introduced with CUDA 12.3 (see docs).

Idea: Add a dedicated codepath for sm_90+ by adding something like

NV_IF_TARGET(some_target_that_means_sm_90_or_higher,
             atomicCAS(...) // 16B CAS,
             // pre-sm_90 code path);

Describe alternatives you've considered

Convince CCCL to expose cuda::atomic_ref::compare_exchange_* for 16B types ;)

Additional context

No response

@sleeepyjack sleeepyjack added topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function labels Jul 16, 2024
@PointKernel
Copy link
Member

Convince CCCL to expose cuda::atomic_ref::compare_exchange_* for 16B types

+1

@sleeepyjack
Copy link
Collaborator Author

Convince CCCL to expose cuda::atomic_ref::compare_exchange_* for 16B types

Discussion thread (NVIDIA internal): https://nvidia.slack.com/archives/CCP05T27R/p1721095033011529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function
Projects
None yet
Development

No branches or pull requests

2 participants