Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Star attention support #3131

Open
2 tasks done
shimizust opened this issue Jan 25, 2025 · 3 comments
Open
2 tasks done

[Feature] Star attention support #3131

shimizust opened this issue Jan 25, 2025 · 3 comments

Comments

@shimizust
Copy link

Checklist

Motivation

Hi folks, I was wondering if there have been any thoughts or investigations on supporting star attention in SGLang (https://arxiv.org/abs/2411.17116, https://github.com/NVIDIA/Star-Attention), which shows significant perf speedups while preserving accuracy.

From the paper:

in the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention.

This seems like it would be challenging to integrate this into SGLang from my own investigation, for example:

  • Changes to the top-level API to support passing context and query
  • A different type of parallel process group (like a star attention process group that loads full model, supports dist comms across processes)
  • The two phases of star attention seem like sub-phases to the EXTEND forward mode, which would require significant changes to scheduler and require additional forked logic in the model forward pass code.
  • Star attention phase 2 requires performing an online softmax. This would require changes to the attention backends to return LSE (probably less performant)
  • Handling correct kv cache (e.g. discard anchor block and the other context blocks kv cache in phase 2) and edge cases around prefix caching (e.g. if cached prefix extends into the query, you would skip phase 1 entirely or if multiple requests in a batch need to undergo different phases of star attention)

Since I’m still getting familiar with SGLang’s codebase, I’d love to hear any insights from the team on whether this would be feasible to integrate. Thank you!

Related resources

GitHub: https://github.com/NVIDIA/Star-Attention
Paper: https://arxiv.org/abs/2411.17116

@zhyncs
Copy link
Member

zhyncs commented Jan 25, 2025

I plan to integrate MInference into SGLang. Please stay tuned. I will begin after recently landing the sgl-kernel.

@shimizust
Copy link
Author

Awesome, that's great to hear! Thanks

@momochen
Copy link

momochen commented Feb 3, 2025

thanks @zhyncs ! If you have a specific timeline in mind please let me know, and we are willing to extend help to expedite this effort as well, in case you need help, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants