[ENHANCEMENT]: Enable packed_cas
codepath using 16B CAS on sm_90+ architectures
#547
Labels
topic: performance
Performance related issue
type: improvement
Improvement / enhancement to an existing function
Is your feature request related to a problem? Please describe.
The
packed_cas
update routine shows better performance compared toback_to_back_cas
andcas_dependent_write
.On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.
Describe the solution you'd like
16B
atomicCAS
was introduced with CUDA 12.3 (see docs).Idea: Add a dedicated codepath for sm_90+ by adding something like
Describe alternatives you've considered
Convince CCCL to expose
cuda::atomic_ref::compare_exchange_*
for 16B types ;)Additional context
No response
The text was updated successfully, but these errors were encountered: