You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@terrytangyuan Thanks! Assigned to you! I will be needing some vLLM based batch chat_completion soon to unblock evaluations, would love to get your help (for PR review or for implementation)!
🚀 Describe the new functionality needed
Evaluations on large datasets can take hours to run inference. Enabling batch inference reduces time to run inference on large datasets.
Quick vLLM inline batch inference benchmark numbers using https://gist.github.com/yanxi0830/4e424f5cfc9a736af800f662c68d0b76
On Llama3.1-70B w/ 80 prompts, 4 GPUs
Providers Support
API
The API should expose capability to:
💡 Why is this needed? What if we don't build it?
Batch inference support to reduce time for running evals & improve inference efficiency.
Other thoughts
No response
The text was updated successfully, but these errors were encountered: