[DRAFT][s4xbf16] Add shmem swizzling heuristics for loading into LinearLayouts #23
+52
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Do not merge until rebased on upstream Triton up to triton-lang@c1ed673
This is 1 of the 2 patches needed to improve int4xbf16 GEMM perf.
This improves shmem swizzling when loading into LinearLayouts. This is needed because when using join/reshape, which is needed for efficient int4 upcasting, the propagated layout would be in LinearLayout rather than DotOp layout. Currently Triton falls back to an unswizzled shmem layout in this case, which is suboptimal.
This PR adds high-level heuristics to generate a swizzled layout for the above case.
cc @gflegar