[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

al-rigazzi · 2025-01-29T11:49:56Z

Problem Description

I have built flash-attention in a fresh environment with ROCm 6.3.1, running on MI250X, and I am confused by the test results.

I believe that the test file to be used is tests/test_flash_attn_ck.py, as the in the non-ck one, a very large portion of the tests fails.

Nevertheless, this is the output of pytest tests/test_flash_attn_ck.py:

FAILED tests/test_flash_attn_ck.py::test_flash_attn_bwd_overflow[5-16-False-dtype0] - AssertionError: assert 0.0750732421875 <= ((5 * 0.01171875) + 0.001)

I have two questions:

is it normal for this test to fail?
I see that, w.r.t. the standard test_flash_attn.py tests, the tolerance has been raised from a factor 2 to a factor 10, mentioning that bwd needs to be fixed. Does this impact the performances of the library, when used in production?

Operating System

SLES 15-SP5

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

Torch was installed with

python3 -m pip install --no-cache-dir --pre torch==2.7.0.dev20250128+rocm6.3 --index-url https://download.pytorch.org/whl/nightly/rocm6.3

and repo is at

22c0358 (HEAD -> main, tag: v2.7.3-cktile, origin/main, origin/HEAD)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-01-29T16:20:49Z

Hi @al-rigazzi. Internal ticket has been created to investigate your issue. Thanks!

schung-amd · 2025-01-30T19:30:08Z

Hi @al-rigazzi, thanks for reporting this!

is it normal for this test to fail?

Looking into this; I wouldn't say it's normal for it to fail (in that we don't intend for it to fail and it's not a known issue), but I don't think we run these tests against the nightly torch builds as part of CI. Is this failing for you with other torch wheels? In particular, we have stable torch wheels in https://repo.radeon.com/rocm/manylinux/ that are more likely to have been tested for this than the wheels on pytorch.org.

I see that, w.r.t. the standard test_flash_attn.py tests, the tolerance has been raised from a factor 2 to a factor 10, mentioning that bwd needs to be fixed. Does this impact the performances of the library, when used in production?

This fix is still pending, not aware of any timeline for it. There is no impact on inference time. In theory there could be some impact on training time (more epochs required), but we haven't heard any reports to this effect thus far.

al-rigazzi · 2025-02-03T15:31:44Z

Thanks, I will try with the wheels you pointed me to (will have to downgrade to PyTorch 2.5) and report back!

schung-amd · 2025-02-11T21:51:25Z

Sorry for the delay, finally had time to try this myself. Reproduced the single test failure with nightly torch 2.7, ROCm 6.3.2, MI210, Ubuntu 24.04, Python 3.12. On the same system, all tests pass with the stable wheels pytorch_triton_rocm-3.0.0+rocm6.3.2.75cc27c26a-cp312-cp312-linux_x86_64.whl and torch-2.4.0+rocm6.3.2-cp312-cp312-linux_x86_64.whl from repo.radeon.com.

As this test failure appears to be related to the CK precision issue you've noted, I suspect using the nightly torch wheel is fine for the purposes of flash attention, but you can also fall back to the stable wheels where the test passes if you wish.

I'll check to see if this failure is already known internally, but if this is caused by the CK precision issue then there isn't much to do until that is addressed.

ppanchad-amd added the Under Investigation label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

al-rigazzi commented Jan 29, 2025

ppanchad-amd commented Jan 29, 2025

schung-amd commented Jan 30, 2025

al-rigazzi commented Feb 3, 2025

schung-amd commented Feb 11, 2025

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

Comments

al-rigazzi commented Jan 29, 2025

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented Jan 29, 2025

schung-amd commented Jan 30, 2025

al-rigazzi commented Feb 3, 2025

schung-amd commented Feb 11, 2025