How to reason about efficiency of different score/mask mod functions #63

alex-hh · 2024-10-22T21:08:47Z

Hi,

The fact that it's possible to create arbitrary score mod / mask mod patterns is really powerful!

I'm wondering if there is any way to reason about the efficiency of different masking patterns (if this is a relevant consideration)?

For example, is a 'full' score_mod e.g. returning bias[b, h, i, j], where bias is some explicitly materialised attention bias tensor going to yield any efficiency gains over manually adding the bias to the attention logits? What are the relative efficiencies of e.g. structured and random sparsity patterns in mask_mod?

Thanks

Chillee · 2024-10-25T03:35:16Z

@alex-hh Generally speaking, the less memory you have to access from outside the kernel, the better. So loading from a full bias (i.e. size S^2) is going to be slower than loading from a 1d bias (i.e. size S), which is going to be slower than loading from.

For sparsity, FlexAttention is fundamentally block-sparse. So pure random sparsity is unlikely to help much.

alex-hh · 2024-10-25T08:08:57Z

Thanks for the reply! Got it re the memory.

Regarding block sparsity - does this mean that given a particular mask_mod pattern, there is potentially an optimal way of permuting the inputs before applying flex attention?

drisspg · 2024-10-28T22:25:12Z

Yeah indeed there is see, see this thread for some discussion: #56

alex-hh changed the title ~~How to reason about efficiency~~ How to reason about efficiency of different score/mask mod functions Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reason about efficiency of different score/mask mod functions #63

How to reason about efficiency of different score/mask mod functions #63

alex-hh commented Oct 22, 2024 •

edited

Loading

Chillee commented Oct 25, 2024

alex-hh commented Oct 25, 2024 •

edited

Loading

drisspg commented Oct 28, 2024

How to reason about efficiency of different score/mask mod functions #63

How to reason about efficiency of different score/mask mod functions #63

Comments

alex-hh commented Oct 22, 2024 • edited Loading

Chillee commented Oct 25, 2024

alex-hh commented Oct 25, 2024 • edited Loading

drisspg commented Oct 28, 2024

alex-hh commented Oct 22, 2024 •

edited

Loading

alex-hh commented Oct 25, 2024 •

edited

Loading