[QST] masking steps in flash decoding #1449

aws-jiadingg · 2025-01-17T23:05:59Z

Flash decoding divides the sequence blocks into a series of splits, with each split assigned to a thread block. However, in the masking step loop (code), every split undergoes the same masking process, even though only the final split might actually require it. Is this the intended behavior? Should there be a control logic to only let the final split go through this "masking steps" loop? Thanks!

tridao · 2025-01-18T04:13:46Z

The outputs are still correct when we have extra masking iterations since the mask takes in m_block and n_block, so if they don't go out of bound the masking code will not change the elements.
Having separate masking iterations is just a speed optimization. You can add the check that only the final split should mask but that seems more complicated (and likely isn't faster during decoding where the bottleneck is loading KV, not computation).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] masking steps in flash decoding #1449

[QST] masking steps in flash decoding #1449

aws-jiadingg commented Jan 17, 2025 •

edited

Loading

tridao commented Jan 18, 2025

[QST] masking steps in flash decoding #1449

[QST] masking steps in flash decoding #1449

Comments

aws-jiadingg commented Jan 17, 2025 • edited Loading

tridao commented Jan 18, 2025

aws-jiadingg commented Jan 17, 2025 •

edited

Loading