You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention.
This seems like it would be challenging to integrate this into SGLang from my own investigation, for example:
Changes to the top-level API to support passing context and query
A different type of parallel process group (like a star attention process group that loads full model, supports dist comms across processes)
The two phases of star attention seem like sub-phases to the EXTEND forward mode, which would require significant changes to scheduler and require additional forked logic in the model forward pass code.
Star attention phase 2 requires performing an online softmax. This would require changes to the attention backends to return LSE (probably less performant)
Handling correct kv cache (e.g. discard anchor block and the other context blocks kv cache in phase 2) and edge cases around prefix caching (e.g. if cached prefix extends into the query, you would skip phase 1 entirely or if multiple requests in a batch need to undergo different phases of star attention)
Since I’m still getting familiar with SGLang’s codebase, I’d love to hear any insights from the team on whether this would be feasible to integrate. Thank you!
thanks @zhyncs ! If you have a specific timeline in mind please let me know, and we are willing to extend help to expedite this effort as well, in case you need help, thanks!
Checklist
Motivation
Hi folks, I was wondering if there have been any thoughts or investigations on supporting star attention in SGLang (https://arxiv.org/abs/2411.17116, https://github.com/NVIDIA/Star-Attention), which shows significant perf speedups while preserving accuracy.
From the paper:
This seems like it would be challenging to integrate this into SGLang from my own investigation, for example:
Since I’m still getting familiar with SGLang’s codebase, I’d love to hear any insights from the team on whether this would be feasible to integrate. Thank you!
Related resources
GitHub: https://github.com/NVIDIA/Star-Attention
Paper: https://arxiv.org/abs/2411.17116
The text was updated successfully, but these errors were encountered: