ROCm Sparse Marlin Kernels #1206

petrex · 2024-10-31T21:44:15Z

Built on top pf #1201. This pull request introduces support for ROCm (Radeon Open Compute) for sparse marling kernel in addition to CUDA, enabling the code to run on AMD GPUs.

The main changes involve conditional compilation to handle differences between CUDA and ROCm, as well as adding ROCm-specific intrinsics for MI300x.

co-author : @lcskrishna

Key changes include:

ROCm Support in `setup.py`:

hip kernels generation

Conditional Compilation in CUDA Source Files:

Added conditional compilation directives to exclude certain code for ROCm and include ROCm-specific implementations.

ROCm-specific Implementations:

Implemented ROCm-specific versions of functions and macros that are different from their CUDA counterparts, ensuring compatibility and performance on AMD GPUs.

validation and benchmark across workloads on MIxxx GPUs

ROCm build infrastructure

[ROCm] Enable Tiled layout extension and minor changes to setup

pytorch-bot · 2024-10-31T21:44:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit f18043d with merge base 98c4e2e ():

NEW FAILURES - The following jobs have failed:

Build M1 Wheels / pytorch/ao / upload / wheel-py3_9-cpu (gh)
The process '/usr/bin/git' failed with exit code 1
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh)
test/dtypes/test_affine_quantized.py::TestAffineQuantizedBasic::test_flatten_unflatten_device_cuda_bfloat16
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch==2.7.0.dev20250122 --index-... / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4
Run TorchAO Experimental Tests / test (macos-14) (gh)
torchao/experimental/tests/test_int8_dynamic_activation_intx_weight.py::TestInt8DynamicActivationIntxWeight::test_export_compile_aoti_PackedLinearInt8DynamicActivationIntxWeightLayout

This comment was automatically generated by Dr. CI and updates every 15 minutes.

setup.py

torchao/csrc/cuda/sparse_marlin/mem.h

msaroufim · 2024-11-04T20:32:26Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

petrex · 2024-11-05T17:17:32Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

still WIP, but would you share the benchmark you guys are using? will try that on mi300x when the PR is ready.

msaroufim · 2024-11-05T19:10:06Z

Ok holler at me again whenever you need a review. Really excited to see this land

drisspg · 2024-11-05T23:06:03Z

For benchmarking it is a little ad hoc the best place for this today would be to verify on: https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py

Fixes builds for non-rocm.

petrex · 2025-01-06T22:15:14Z

@pytorchbot rebase

pytorch-bot · 2025-01-06T22:15:16Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

pytorch-bot · 2025-01-07T18:03:51Z

Unknown label ciflow/rocm.
Currently recognized labels are

ciflow/benchmark

msaroufim

seems good to me, I'll lean on @atalman and @jcaip for the final merge since the error you're seeing in CI does seem like an underlying infra issue. It's not a flake though, I tried rerunning it and it still fails

jcaip

Looks good! Did you get a chance to try this and get benchmarking numbers? Curious to see how it compares. We should probably update the testing framework too for AMD

jcaip · 2025-01-09T20:35:09Z

torchao/csrc/cuda/sparse_marlin/mem.h

@@ -19,6 +19,28 @@
 #include "base.h"

 namespace torchao {
+
+#ifdef USE_ROCM


Should we gate on a specific ROCm version like we do for CUDA?

Good point! What we need is a GPU arch check instead of ROCm version check. I have added a GPU architecture check in the setup.py . As a result, the kernel will now only be built for the MI300X architecture.

Sounds good, I think setup.py was recently updated by #1490, so you may have to pull in the new changes.

jcaip · 2025-01-09T20:47:23Z

torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu

+#if defined(USE_ROCM)
+#if ROCM_VERSION >= 60200
+  auto BF16_BIAS = __bfloat162bfloat162(__hip_bfloat16(__hip_bfloat16_raw{0xC308}));
+  auto BF16_ONE = __bfloat162bfloat162(__hip_bfloat16(__hip_bfloat16_raw{0x3F80}));


what does B16_ONE refer to here?

thanks let me clean up a little bit. I'd like this PR focus on sparse_marlin. tensor_core_tile_layout.cu should go to #1201 instead.

0x3F80 in BF16 : Sign bit (0) + Exponent (111111) + Mantissa (00000000) = 1.0
Just renamed it in #1201 to reflect this.
see : 26fa19c

petrex · 2025-01-09T22:16:49Z

Looks good! Did you get a chance to try this and get benchmarking numbers? Curious to see how it compares. We should probably update the testing framework too for AMD

thanks It is planned. I will update the benchmark PR.

jcaip

LGTM, should be good to merge once we fix the setup.py conflicts.

refactor for better readibility

petrex · 2025-01-16T01:20:27Z

LGTM, should be good to merge once we fix the setup.py conflicts.

done.

lcskrishna and others added 10 commits October 16, 2024 05:19

enable build for rocm for fp6_llm

6d92e40

Merge pull request #1 from lcskrishna/cl/rocm-enablement

14b3fce

ROCm build infrastructure

enable tiled layout extension

f1a22cf

fix build error related to option

0bef6ca

require rocm 6.2

893ae03

enable tensor tiled layout extension with successful compilation

a0d3788

enable successful build

e4e654d

clean-up

3e2c6a1

Merge pull request #3 from lcskrishna/csrikris_enable_tensor_tile

c86880e

[ROCm] Enable Tiled layout extension and minor changes to setup

fix potential memory access issue

91d3c75

pytorch-bot bot added the module: rocm label Oct 31, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2024

msaroufim requested review from msaroufim and removed request for msaroufim November 2, 2024 22:51

msaroufim reviewed Nov 4, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

msaroufim reviewed Nov 4, 2024

View reviewed changes

torchao/csrc/cuda/sparse_marlin/mem.h Show resolved Hide resolved

jcaip mentioned this pull request Nov 11, 2024

AMD integration tracker #1260

Open

1 task

petrex and others added 8 commits November 12, 2024 16:17

fix __nv_bfloat162 init

38b7d1c

add comment for MI300x isa

279f4b3

Merge branch 'main' into rocm_enablement_staging

612ad14

fix build for non-rocm

bbf5a72

Merge pull request #4 from lcskrishna/rocm_enablement

735570e

Fixes builds for non-rocm.

Merge branch 'main' into rocm_enablement_staging

253c188

add sparse_marlin kernel to the build

a2f1736

drop .h from conversion

f817edf

lcskrishna and others added 3 commits January 6, 2025 16:06

update copy from global to lds

ecc3927

implement cvta_to_shared()

a80730b

consolidate code with cvta_to_shared()

d2c7ce4

petrex force-pushed the rocm_sparse_marlin branch from 00bc94d to d2c7ce4 Compare January 6, 2025 22:06

petrex added the topic: new feature Use this tag if this PR adds a new feature label Jan 6, 2025

petrex self-assigned this Jan 6, 2025

petrex marked this pull request as ready for review January 7, 2025 17:22

pytorch-bot bot added the ciflow/rocm label Jan 7, 2025

petrex requested a review from msaroufim January 8, 2025 15:59

msaroufim approved these changes Jan 8, 2025

View reviewed changes

Merge branch 'main' into rocm_sparse_marlin

15974c7

petrex requested review from jcaip and atalman January 8, 2025 20:18

lint

a4e8c30

petrex force-pushed the rocm_sparse_marlin branch from 662bfe7 to a4e8c30 Compare January 8, 2025 23:06

jcaip reviewed Jan 9, 2025

View reviewed changes

add GPU arch check for MI300x

c678cb0

revert change in tensor_core_tile_layout.cu

08d1cfb

petrex force-pushed the rocm_sparse_marlin branch from 1f3b773 to 08d1cfb Compare January 9, 2025 22:34

petrex requested a review from jcaip January 10, 2025 00:25

jcaip approved these changes Jan 14, 2025

View reviewed changes

petrex and others added 2 commits January 15, 2025 15:03

Merge branch 'main' into rocm_sparse_marlin

b96196b

lint

aea9d81

refactor for better readibility

petrex force-pushed the rocm_sparse_marlin branch from 3185c9d to aea9d81 Compare January 15, 2025 23:52

Merge branch 'main' into rocm_sparse_marlin

f18043d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm Sparse Marlin Kernels #1206

ROCm Sparse Marlin Kernels #1206

petrex commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

msaroufim commented Nov 4, 2024

petrex commented Nov 5, 2024 •

edited

Loading

msaroufim commented Nov 5, 2024

drisspg commented Nov 5, 2024

petrex commented Jan 6, 2025

pytorch-bot bot commented Jan 6, 2025

pytorch-bot bot commented Jan 7, 2025

msaroufim left a comment

jcaip left a comment •

edited

Loading

jcaip Jan 9, 2025

petrex Jan 9, 2025 •

edited

Loading

jcaip Jan 14, 2025

jcaip Jan 9, 2025

petrex Jan 9, 2025

petrex Jan 9, 2025 •

edited

Loading

petrex commented Jan 9, 2025

jcaip left a comment

petrex commented Jan 16, 2025

ROCm Sparse Marlin Kernels #1206

Are you sure you want to change the base?

ROCm Sparse Marlin Kernels #1206

Conversation

petrex commented Oct 31, 2024 • edited Loading

ROCm Support in setup.py:

Conditional Compilation in CUDA Source Files:

ROCm-specific Implementations:

pytorch-bot bot commented Oct 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

❌ 6 New Failures

msaroufim commented Nov 4, 2024

petrex commented Nov 5, 2024 • edited Loading

msaroufim commented Nov 5, 2024

drisspg commented Nov 5, 2024

petrex commented Jan 6, 2025

pytorch-bot bot commented Jan 6, 2025

pytorch-bot bot commented Jan 7, 2025

msaroufim left a comment

Choose a reason for hiding this comment

jcaip left a comment • edited Loading

Choose a reason for hiding this comment

jcaip Jan 9, 2025

Choose a reason for hiding this comment

petrex Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

jcaip Jan 14, 2025

Choose a reason for hiding this comment

jcaip Jan 9, 2025

Choose a reason for hiding this comment

petrex Jan 9, 2025

Choose a reason for hiding this comment

petrex Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

petrex commented Jan 9, 2025

jcaip left a comment

Choose a reason for hiding this comment

petrex commented Jan 16, 2025

petrex commented Oct 31, 2024 •

edited

Loading

ROCm Support in `setup.py`:

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

petrex commented Nov 5, 2024 •

edited

Loading

jcaip left a comment •

edited

Loading

petrex Jan 9, 2025 •

edited

Loading

petrex Jan 9, 2025 •

edited

Loading