flexattn with qwen2 #81

NonvolatileMemory · 2024-11-18T13:07:51Z

seems flexattn cannot support numheads=28?

drisspg · 2024-11-18T22:07:05Z

Do you have a repro? I just tried this and it appears to be working for me. Notably, I'm on Nightly version of pytorch

import torch

from torch.nn.attention.flex_attention import flex_attention, create_block_mask


def causal_mask(b, h, q_idx, kv_idx):
   return q_idx >= kv_idx


b, h, s, d = 1, 28, 256, 64
tens = torch.rand(b, h, s, d, device="cuda")

flex = torch.compile(flex_attention)

bm = create_block_mask(causal_mask, None, None, s, s)

print(flex(tens, tens, tens, block_mask=bm))

NonvolatileMemory · 2024-11-20T06:27:10Z

Hi!

Here is my code

def diff(bsz=4, seq_len=1024, d_head=128, num_heads=28, block_size=4):
    # torch_attn

    Q = torch.randn(bsz, num_heads, seq_len, d_head)#.cuda()
    K = torch.randn(bsz, 4, seq_len, d_head)#.cuda()
    V = torch.randn(bsz, 4, seq_len, d_head)#.cuda()

    scores = torch.matmul(Q, K.permute(0, 1, 3, 2)) / (Q.size(-1) ** 0.5)

    q_idx = torch.arange(seq_len).view(-1, 1)
    kv_idx = torch.arange(seq_len).view(1, -1)
    mask = torch_mask(q_idx, kv_idx, block_size)[None, None, :, :].cuda()

    # scores = scores.masked_fill(~mask, float('-inf'))
    # attn_weights = F.softmax(scores, dim=-1)
    # torch_out = torch.matmul(attn_weights, V)
    sub_block_mask = create_block_mask(block_mask, B=None, H=None, Q_LEN=seq_len, KV_LEN=seq_len,  _compile=True)
    flex_out = flex_attn(Q, K, V, block_mask=sub_block_mask, enable_gqa=True)
    return flex_out
    # return (flex_out[:, :, 16:] - torch_out[:, :, 16:]).max()
    
def block_mask(b, h, q_idx, kv_idx):
    q_block = q_idx // 4
    kv_block = kv_idx // 4
    return q_block > kv_block
    ```

NonvolatileMemory · 2024-11-20T06:28:02Z

Do you have a repro? I just tried this and it appears to be working for me. Notably, I'm on Nightly version of pytorch

import torch

from torch.nn.attention.flex_attention import flex_attention, create_block_mask


def causal_mask(b, h, q_idx, kv_idx):
   return q_idx >= kv_idx


b, h, s, d = 1, 28, 256, 64
tens = torch.rand(b, h, s, d, device="cuda")

flex = torch.compile(flex_attention)

bm = create_block_mask(causal_mask, None, None, s, s)

print(flex(tens, tens, tens, block_mask=bm))

Maybe because I am using the 2.5.0 ver of torch instead of nightly?

drisspg · 2024-11-20T17:16:37Z

Yeah, potentially. Would you mind trying nightly?

NonvolatileMemory · 2025-01-03T10:11:07Z

Do you have a repro? I just tried this and it appears to be working for me. Notably, I'm on Nightly version of pytorch

import torch

from torch.nn.attention.flex_attention import flex_attention, create_block_mask


def causal_mask(b, h, q_idx, kv_idx):
   return q_idx >= kv_idx


b, h, s, d = 1, 28, 256, 64
tens = torch.rand(b, h, s, d, device="cuda")

flex = torch.compile(flex_attention)

bm = create_block_mask(causal_mask, None, None, s, s)

print(flex(tens, tens, tens, block_mask=bm))

another problem maybe you do not use gqa, but I use it?

drisspg · 2025-01-04T01:31:59Z

Do you have an exact repro?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flexattn with qwen2 #81

flexattn with qwen2 #81

NonvolatileMemory commented Nov 18, 2024

drisspg commented Nov 18, 2024

NonvolatileMemory commented Nov 20, 2024

NonvolatileMemory commented Nov 20, 2024

drisspg commented Nov 20, 2024

NonvolatileMemory commented Jan 3, 2025

drisspg commented Jan 4, 2025

flexattn with qwen2 #81

flexattn with qwen2 #81

Comments

NonvolatileMemory commented Nov 18, 2024

drisspg commented Nov 18, 2024

NonvolatileMemory commented Nov 20, 2024

NonvolatileMemory commented Nov 20, 2024

drisspg commented Nov 20, 2024

NonvolatileMemory commented Jan 3, 2025

drisspg commented Jan 4, 2025