Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

Open
al-rigazzi opened this issue Jan 29, 2025 · 4 comments
Open

[Issue]: Test failing with ROCm 6.3.1 on MI250X #120

al-rigazzi opened this issue Jan 29, 2025 · 4 comments

Comments

@al-rigazzi
Copy link

Problem Description

I have built flash-attention in a fresh environment with ROCm 6.3.1, running on MI250X, and I am confused by the test results.

I believe that the test file to be used is tests/test_flash_attn_ck.py, as the in the non-ck one, a very large portion of the tests fails.

Nevertheless, this is the output of pytest tests/test_flash_attn_ck.py:

FAILED tests/test_flash_attn_ck.py::test_flash_attn_bwd_overflow[5-16-False-dtype0] - AssertionError: assert 0.0750732421875 <= ((5 * 0.01171875) + 0.001)

I have two questions:

  1. is it normal for this test to fail?
  2. I see that, w.r.t. the standard test_flash_attn.py tests, the tolerance has been raised from a factor 2 to a factor 10, mentioning that bwd needs to be fixed. Does this impact the performances of the library, when used in production?

Operating System

SLES 15-SP5

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

Torch was installed with

python3 -m pip install --no-cache-dir --pre torch==2.7.0.dev20250128+rocm6.3 --index-url https://download.pytorch.org/whl/nightly/rocm6.3

and repo is at

22c0358 (HEAD -> main, tag: v2.7.3-cktile, origin/main, origin/HEAD)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @al-rigazzi. Internal ticket has been created to investigate your issue. Thanks!

@schung-amd
Copy link

Hi @al-rigazzi, thanks for reporting this!

is it normal for this test to fail?

Looking into this; I wouldn't say it's normal for it to fail (in that we don't intend for it to fail and it's not a known issue), but I don't think we run these tests against the nightly torch builds as part of CI. Is this failing for you with other torch wheels? In particular, we have stable torch wheels in https://repo.radeon.com/rocm/manylinux/ that are more likely to have been tested for this than the wheels on pytorch.org.

I see that, w.r.t. the standard test_flash_attn.py tests, the tolerance has been raised from a factor 2 to a factor 10, mentioning that bwd needs to be fixed. Does this impact the performances of the library, when used in production?

This fix is still pending, not aware of any timeline for it. There is no impact on inference time. In theory there could be some impact on training time (more epochs required), but we haven't heard any reports to this effect thus far.

@al-rigazzi
Copy link
Author

Thanks, I will try with the wheels you pointed me to (will have to downgrade to PyTorch 2.5) and report back!

@schung-amd
Copy link

Sorry for the delay, finally had time to try this myself. Reproduced the single test failure with nightly torch 2.7, ROCm 6.3.2, MI210, Ubuntu 24.04, Python 3.12. On the same system, all tests pass with the stable wheels pytorch_triton_rocm-3.0.0+rocm6.3.2.75cc27c26a-cp312-cp312-linux_x86_64.whl and torch-2.4.0+rocm6.3.2-cp312-cp312-linux_x86_64.whl from repo.radeon.com.

As this test failure appears to be related to the CK precision issue you've noted, I suspect using the nightly torch wheel is fine for the purposes of flash attention, but you can also fall back to the stable wheels where the test passes if you wish.

I'll check to see if this failure is already known internally, but if this is caused by the CK precision issue then there isn't much to do until that is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants