Batch Inference Support #1071

yanxi0830 · 2025-02-13T08:08:46Z

🚀 Describe the new functionality needed

Evaluations on large datasets can take hours to run inference. Enabling batch inference reduces time to run inference on large datasets.

Quick vLLM inline batch inference benchmark numbers using https://gist.github.com/yanxi0830/4e424f5cfc9a736af800f662c68d0b76

On Llama3.1-70B w/ 80 prompts, 4 GPUs

w/ batch inference: 2297.87 toks/s
w/o batch inference: 47.65 toks/s

Providers Support

Inline
- vLLM
Remote
- We will need to experiment & benchmark performance with sending parallel requests

API

The API should expose capability to:

run batch inference inline
work together with job management to create batch inference jobs.

@json_schema_type
class BatchChatCompletionResponse(BaseModel):
    batch: List[ChatCompletionResponse]

class BatchInference(Protocol):
    @webmethod(route="/batch-inference/chat-completion", method="POST")
    async def batch_chat_completion(
        self,
        model: str,
        messages_batch: List[List[Message]],
        sampling_params: Optional[SamplingParams] = SamplingParams(),
        # zero-shot tool definitions as input to the model
        tools: Optional[List[ToolDefinition]] = list,
        tool_choice: Optional[ToolChoice] = ToolChoice.auto,
        tool_prompt_format: Optional[ToolPromptFormat] = None,
        response_format: Optional[ResponseFormat] = None,
        logprobs: Optional[LogProbConfig] = None,
    ) -> BatchChatCompletionResponse: ...

   @webmethod(route="/batch-inference/jobs", method="POST")
    async def schedule_batch(
        self,
        # we will need to define schema for file format
        file_id: str,
    ) -> Job: ...

    @webmethod(route="/batch-inference/jobs/{job_id}", method="GET")
    async def job_status(self, job_id: str) -> Optional[JobStatus]: ...

    @webmethod(route="/batch-inference/jobs/{job_id}", method="DELETE")
    async def job_cancel(self, job_id: str) -> None: ...

    @webmethod(route="/batch-inference/jobs/{job_id}/result", method="GET")
    async def job_result(self, job_id: str) -> BatchChatCompletionResponse: ...

💡 Why is this needed? What if we don't build it?

Batch inference support to reduce time for running evals & improve inference efficiency.

Other thoughts

No response

terrytangyuan · 2025-02-13T16:17:34Z

@yanxi0830 Is this something we can pick up? If so, feel free to assign. We've been looking to add batch inference API as well.

yanxi0830 · 2025-02-13T18:43:03Z

@terrytangyuan Thanks! Assigned to you! I will be needing some vLLM based batch chat_completion soon to unblock evaluations, would love to get your help (for PR review or for implementation)!

yanxi0830 added the enhancement New feature or request label Feb 13, 2025

yanxi0830 assigned terrytangyuan Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Inference Support #1071

Batch Inference Support #1071

yanxi0830 commented Feb 13, 2025

terrytangyuan commented Feb 13, 2025

yanxi0830 commented Feb 13, 2025

Batch Inference Support #1071

Batch Inference Support #1071

Comments

yanxi0830 commented Feb 13, 2025

🚀 Describe the new functionality needed

Providers Support

API

💡 Why is this needed? What if we don't build it?

Other thoughts

terrytangyuan commented Feb 13, 2025

yanxi0830 commented Feb 13, 2025