feat: group-query-attention implementation #74

flxst · 2024-03-12T09:46:57Z

Re-opened version of #41.

Potential solution for handling the combination of GQA and FlashAttention: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

wrong PR

lhahn-iis · 2024-03-18T15:32:49Z

Short update after our changes from today:

After our decision today to try out https://github.com/Dao-AILab/flash-attention/tree/main instead of our own (Group Query) Attention implementation, we provided a first draft.

Now two things remain to be done:

Benchmark the new implementation against the previous. For this, @fromm-m agreed to launch some test setups on Leonardo and check the throughput
Add a remark about the installation of flash-attn. The issues is currently, that flash-attn requires some prerequesites to get installed. It needs atleast CUDA 11.6 installed. On top, one needs to install some build dependencies. The authors mentioned here to use a non-isolated build environment (probably to access the installed CUDA version, to compile stuff). Unfortunately, this is not trivial to represent without our own pyproject.toml. My suggestion here would be either to check, if this is somehow achievable with a clever trick (e.g. using this, but I have some real doubts if this works). Alternatively we could add a respective remark in the README.md.

I won't be able to have a look at this until thursday this week.

fromm-m · 2024-03-19T13:36:49Z

Short update after our changes from today:

After our decision today to try out https://github.com/Dao-AILab/flash-attention/tree/main instead of our own (Group Query) Attention implementation, we provided a first draft.

Now two things remain to be done:

Benchmark the new implementation against the previous. For this, @fromm-m agreed to launch some test setups on Leonardo and check the throughput

Add a remark about the installation of flash-attn. The issues is currently, that flash-attn requires some prerequesites to get installed. It needs atleast CUDA 11.6 installed. On top, one needs to install some build dependencies. The authors mentioned here to use a non-isolated build environment (probably to access the installed CUDA version, to compile stuff). Unfortunately, this is not trivial to represent without our own pyproject.toml. My suggestion here would be either to check, if this is somehow achievable with a clever trick (e.g. using this, but I have some real doubts if this works). Alternatively we could add a respective remark in the README.md.

I won't be able to have a look at this until thursday this week.

Regarding your first point:

I did a benchmark and it works even faster than the previous pytorch flash attention implementation on a 3B paramter scale.

Regarding your second point:
We opened a new Issue #86 to refactor the readme, where we will also describe the installation of FlashAttention.

… GQA_2

config_files/config.yaml

fromm-m and others added 9 commits January 30, 2024 18:12

feat: group-query-attention implementation

f0ea511

chore: merge main into GQA

ec8c807

chore: align configs with new GQA keys

5ae1c63

docs: add potential removal marker for "scaling_factor"

8da8c1a

test: add attention forward pass test for GQA

415b0a6

fix: add verbose check for divisibility of K,V,Q matrix shapes

6d4a6cf

refactor: remove AttentionConfig

e0e274d

debug: expanded KVs for GQA implementation

3725a54

fix: group query attention implementation

1239381

flxst added the enhancement New feature or request label Mar 12, 2024

flxst assigned flxst and lhahn-iis Mar 12, 2024

flxst mentioned this pull request Mar 12, 2024

feat: group-query-attention implementation #72

Closed

flxst added 3 commits March 12, 2024 13:22

refactor: test causal self-attention

6d9849d

refactor: test causal self-attention (continued)

65519c8

test: causal self-attention type equality

c3242e3

fromm-m previously approved these changes Mar 12, 2024

View reviewed changes

le1nux force-pushed the main branch 3 times, most recently from cb6e816 to 179052b Compare March 13, 2024 22:14

flxst mentioned this pull request Mar 15, 2024

Remove duplicate code between GPT2 and CoCa #63

Open

1 task

feat: replace current attention mechanism with flash-attn

7d27b59

fromm-m added 2 commits March 18, 2024 20:21

Merge branch 'main' into GQA_2

66addf4

fix: fixed qkv test

69fd5eb

fromm-m assigned mali-git and le1nux Mar 19, 2024

refactor: refactored flash_attention

0431479

mali-git added 2 commits March 20, 2024 11:04

refactor: simplifiy reshaping and remove unused imports

bc27773

refactor: refactor test

fb88edb

le1nux self-requested a review March 20, 2024 13:03

le1nux approved these changes Mar 20, 2024

View reviewed changes

fromm-m approved these changes Mar 20, 2024

View reviewed changes

fromm-m added 4 commits March 20, 2024 13:06

fix: bugfix

7235b73

Merge branch 'GQA_2' of https://github.com/Modalities/modalities into…

85f2224

… GQA_2

fix: fixed test

3518ec1

fix: fix linting issues

a4af491

fromm-m requested a review from mali-git March 20, 2024 13:15

mali-git reviewed Mar 20, 2024

View reviewed changes

config_files/config.yaml Show resolved Hide resolved

fromm-m added 2 commits March 20, 2024 13:16

fix: fixed config

14906e9

fix: improved the error message

aac9a96

mali-git approved these changes Mar 20, 2024

View reviewed changes

mali-git merged commit adb1f28 into main Mar 20, 2024

flxst mentioned this pull request Apr 28, 2024

fix: fixed tensor reshape operations #115

Merged

fromm-m deleted the GQA_2 branch June 17, 2024 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: group-query-attention implementation #74

feat: group-query-attention implementation #74

flxst commented Mar 12, 2024

lhahn-iis commented Mar 18, 2024

fromm-m commented Mar 19, 2024

feat: group-query-attention implementation #74

feat: group-query-attention implementation #74

Conversation

flxst commented Mar 12, 2024

lhahn-iis commented Mar 18, 2024

fromm-m commented Mar 19, 2024