-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: group-query-attention implementation #74
Conversation
cb6e816
to
179052b
Compare
Short update after our changes from today: After our decision today to try out https://github.com/Dao-AILab/flash-attention/tree/main instead of our own (Group Query) Attention implementation, we provided a first draft. Now two things remain to be done:
I won't be able to have a look at this until thursday this week. |
Regarding your first point: I did a benchmark and it works even faster than the previous pytorch flash attention implementation on a 3B paramter scale. Regarding your second point: |
Re-opened version of #41.
Potential solution for handling the combination of GQA and FlashAttention: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html