-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Test failing with ROCm 6.3.1 on MI250X #120
Comments
Hi @al-rigazzi. Internal ticket has been created to investigate your issue. Thanks! |
Hi @al-rigazzi, thanks for reporting this!
Looking into this; I wouldn't say it's normal for it to fail (in that we don't intend for it to fail and it's not a known issue), but I don't think we run these tests against the nightly torch builds as part of CI. Is this failing for you with other torch wheels? In particular, we have stable torch wheels in https://repo.radeon.com/rocm/manylinux/ that are more likely to have been tested for this than the wheels on pytorch.org.
This fix is still pending, not aware of any timeline for it. There is no impact on inference time. In theory there could be some impact on training time (more epochs required), but we haven't heard any reports to this effect thus far. |
Thanks, I will try with the wheels you pointed me to (will have to downgrade to PyTorch 2.5) and report back! |
Sorry for the delay, finally had time to try this myself. Reproduced the single test failure with nightly torch 2.7, ROCm 6.3.2, MI210, Ubuntu 24.04, Python 3.12. On the same system, all tests pass with the stable wheels As this test failure appears to be related to the CK precision issue you've noted, I suspect using the nightly torch wheel is fine for the purposes of flash attention, but you can also fall back to the stable wheels where the test passes if you wish. I'll check to see if this failure is already known internally, but if this is caused by the CK precision issue then there isn't much to do until that is addressed. |
Problem Description
I have built flash-attention in a fresh environment with ROCm 6.3.1, running on MI250X, and I am confused by the test results.
I believe that the test file to be used is
tests/test_flash_attn_ck.py
, as the in the non-ck
one, a very large portion of the tests fails.Nevertheless, this is the output of
pytest tests/test_flash_attn_ck.py
:I have two questions:
test_flash_attn.py
tests, the tolerance has been raised from a factor 2 to a factor 10, mentioning thatbwd
needs to be fixed. Does this impact the performances of the library, when used in production?Operating System
SLES 15-SP5
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 6.3.0
ROCm Component
No response
Steps to Reproduce
Torch was installed with
and repo is at
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: