[Feature] Star attention support #3131

shimizust · 2025-01-25T18:44:49Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi folks, I was wondering if there have been any thoughts or investigations on supporting star attention in SGLang (https://arxiv.org/abs/2411.17116, https://github.com/NVIDIA/Star-Attention), which shows significant perf speedups while preserving accuracy.

From the paper:

in the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention.

This seems like it would be challenging to integrate this into SGLang from my own investigation, for example:

Changes to the top-level API to support passing context and query
A different type of parallel process group (like a star attention process group that loads full model, supports dist comms across processes)
The two phases of star attention seem like sub-phases to the EXTEND forward mode, which would require significant changes to scheduler and require additional forked logic in the model forward pass code.
Star attention phase 2 requires performing an online softmax. This would require changes to the attention backends to return LSE (probably less performant)
Handling correct kv cache (e.g. discard anchor block and the other context blocks kv cache in phase 2) and edge cases around prefix caching (e.g. if cached prefix extends into the query, you would skip phase 1 entirely or if multiple requests in a batch need to undergo different phases of star attention)

Since I’m still getting familiar with SGLang’s codebase, I’d love to hear any insights from the team on whether this would be feasible to integrate. Thank you!

Related resources

GitHub: https://github.com/NVIDIA/Star-Attention
Paper: https://arxiv.org/abs/2411.17116

zhyncs · 2025-01-25T18:47:46Z

I plan to integrate MInference into SGLang. Please stay tuned. I will begin after recently landing the sgl-kernel.

shimizust · 2025-01-26T15:09:56Z

Awesome, that's great to hear! Thanks

momochen · 2025-02-03T19:24:43Z

thanks @zhyncs ! If you have a specific timeline in mind please let me know, and we are willing to extend help to expedite this effort as well, in case you need help, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Star attention support #3131

[Feature] Star attention support #3131

shimizust commented Jan 25, 2025

zhyncs commented Jan 25, 2025

shimizust commented Jan 26, 2025

momochen commented Feb 3, 2025

[Feature] Star attention support #3131

[Feature] Star attention support #3131

Comments

shimizust commented Jan 25, 2025

Checklist

Motivation

Related resources

zhyncs commented Jan 25, 2025

shimizust commented Jan 26, 2025

momochen commented Feb 3, 2025