-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FA3 forward performance regression on H200 #1438
Comments
What TFLOPS do you get? |
Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
I used commit: 3cea2fb. |
Can you try the latest commit? |
Hi, @tridao I tried the latest commit. But when running
|
I have posted the error message as an issue. |
Hi, @tridao , after fixing the setup issue. I started to benchmark on the latest build.
However, based on the latest commit, H200 is consistently slower than H100, at bf16 fwd, bf16 bwd, fp8 fwd. The performance drop on H200 ranges from 1% to 3%. |
That's great to hear! Btw you can set these env variables to make compilation faster (by disabling features):
|
I did some benchmark test on H200 at bf16 and fp8 precision.
I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).
I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?
The text was updated successfully, but these errors were encountered: