Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Inference Support #1071

Open
yanxi0830 opened this issue Feb 13, 2025 · 2 comments
Open

Batch Inference Support #1071

yanxi0830 opened this issue Feb 13, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@yanxi0830
Copy link
Contributor

🚀 Describe the new functionality needed

Evaluations on large datasets can take hours to run inference. Enabling batch inference reduces time to run inference on large datasets.

Quick vLLM inline batch inference benchmark numbers using https://gist.github.com/yanxi0830/4e424f5cfc9a736af800f662c68d0b76

On Llama3.1-70B w/ 80 prompts, 4 GPUs

  • w/ batch inference: 2297.87 toks/s
  • w/o batch inference: 47.65 toks/s

Providers Support

  • Inline
    • vLLM
  • Remote
    • We will need to experiment & benchmark performance with sending parallel requests

API

The API should expose capability to:

  • run batch inference inline
  • work together with job management to create batch inference jobs.
@json_schema_type
class BatchChatCompletionResponse(BaseModel):
    batch: List[ChatCompletionResponse]

class BatchInference(Protocol):
    @webmethod(route="/batch-inference/chat-completion", method="POST")
    async def batch_chat_completion(
        self,
        model: str,
        messages_batch: List[List[Message]],
        sampling_params: Optional[SamplingParams] = SamplingParams(),
        # zero-shot tool definitions as input to the model
        tools: Optional[List[ToolDefinition]] = list,
        tool_choice: Optional[ToolChoice] = ToolChoice.auto,
        tool_prompt_format: Optional[ToolPromptFormat] = None,
        response_format: Optional[ResponseFormat] = None,
        logprobs: Optional[LogProbConfig] = None,
    ) -> BatchChatCompletionResponse: ...

   @webmethod(route="/batch-inference/jobs", method="POST")
    async def schedule_batch(
        self,
        # we will need to define schema for file format
        file_id: str,
    ) -> Job: ...

    @webmethod(route="/batch-inference/jobs/{job_id}", method="GET")
    async def job_status(self, job_id: str) -> Optional[JobStatus]: ...

    @webmethod(route="/batch-inference/jobs/{job_id}", method="DELETE")
    async def job_cancel(self, job_id: str) -> None: ...

    @webmethod(route="/batch-inference/jobs/{job_id}/result", method="GET")
    async def job_result(self, job_id: str) -> BatchChatCompletionResponse: ...

💡 Why is this needed? What if we don't build it?

Batch inference support to reduce time for running evals & improve inference efficiency.

Other thoughts

No response

@yanxi0830 yanxi0830 added the enhancement New feature or request label Feb 13, 2025
@terrytangyuan
Copy link
Collaborator

@yanxi0830 Is this something we can pick up? If so, feel free to assign. We've been looking to add batch inference API as well.

@yanxi0830
Copy link
Contributor Author

@terrytangyuan Thanks! Assigned to you! I will be needing some vLLM based batch chat_completion soon to unblock evaluations, would love to get your help (for PR review or for implementation)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants