Replies: 1 comment
-
The draft implementation of this (very incomplete and buggy) can be found at: booxter#1 (posted as PR in my fork for easier reader experience); hopefully it's helpful when trying to make sense of the write-up above. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all.
This is a proposal to streamline jobs management through API as well as give providers tools to manage asynchronous and even remote jobs while not blocking access to API.
I converted it from a Google Document using its Markdown copy-paste feature. I hope it worked (it seems to), but if you find some issues, please let me know and I will confirm if it's not a conversion problem.
Context
Providers need to be able to run long running tasks (jobs) asynchronously, letting the user interact with the API, monitor the status of the job etc. This is currently not possible; though asynchronous behavior is modeled in - some - Llama Stack APIs.
For example, this results in API client timeouts when executing training actions against the
torchtune
provider. This is because the provider doesn’t return control to the API caller until the training run is complete, which may take longer than the API timeout.This proposal attempts to address this issue with blocking calls to inline providers. It also attempts to streamline the API to interact with long running Jobs.
Note: the proposal does not attempt to force all providers to use the same scheduler backend, and the choice of a backend per provider is baked in in the proposed design. That said, the proposal attempts to build a common jobs interface and useful tools to work with long running tasks (jobs) that can be shared by providers. The actual adoption of the common patterns will depend on the usefulness of the proposed tools.
Note: the proposal doesn’t affect APIs that do not implement Jobs behavior. If there’s an API that doesn’t need a Job entity to return to the user for later status monitoring, artifacts extraction etc., then the document doesn’t apply to it at all.
Terminology
Requirements
Out of Scope
UX
Jobs are a general concept that can be used by different APIs. We’d like to have a uniform way of dealing with jobs. Hence the proposal to introduce a new
/jobs
endpoint to llama-server.Note: a new
/jobs
API is not aligned with the current approach to managing jobs as declared forpost-training
API, where a job status is extracted usingpost-training
API. This proposal suggests that we deprecate and retire thepost-training
jobs API endpoints, but it doesn’t require this to happen in the near future (or at all).Additional Job states may be introduced to cover additional Scheduler states we may need (e.g. “new” or “cancelled” or “scheduled” or “paused”...)
Python SDK
From Python SDK user perspective, the interaction with the server could look as follows:
“REST” API
Implementation
Scheduler
Scheduler will run as long as the llama-server process runs. It is not a separate service and will not maintain its separate lifecycle. E.g. it may be implemented as an async loop that receives requests from providers to run Jobs, decides which Jobs to run at any particular time, establish duplex communication between Jobs, on one hand, and execution environment, on the other, monitors their health, updates about Jobs’ statuses, resumes as needed, harvests logs etc.
The Scheduler is meant to be a common backend service available for all providers that need a long running job execution. Including remote providers.
Interface
To represent a particular Job, a separate class will be introduced. The provider that needs to run a long running task will pass a Job handler to the Scheduler to execute. The Scheduler will initialize a Job object used to track its execution.
The main handler for a Job will receive a set of callbacks that the Job implementation can trigger to communicate with the Scheduler. These are useful to be able to update the Scheduler about a status change, or new logs, artifacts, etc. (TBD exact set of callbacks needed.)
The Python scheduler interface with providers is sketched out in this branch. (See
scheduler.py
file andpost_training.py
to understand how the interaction would look like.)Whenever a provider needs to start a Job, they can, conceptually, do:
It is assumed that all providers will schedule their long running tasks through the interface, which will guarantee consistent user experience (e.g. consistent Job UUID format, accessibility of job status, artifacts via a global Jobs API, ability to cancel or remove Jobs etc.)
Depending on the type of the provider and the job involved to maintain a backend workload, the actual implementation of the job handler may be as simple as waiting for a provider job ID to complete on a remote service endpoint (for remote providers).
Backend
This proposal doesn’t assume a particular scheduler backend, as long as it can implement the Scheduler interface. Each provider will be able to choose a particular scheduler backend as needed. (With the caveat that there’s a good reason to bring a new one and not reuse an existing one.)
The particular backends of choice are to be decided. Possible options include:
The particular backend will be hidden behind Scheduler facade and will be designed as swappable (driven by the configuration file).
I expect the first scheduler implementation will either be completely in-tree or use a lightweight library to schedule tasks. We’ll need to look at other available options closer to make a decision.
Some of the backends listed may support non-local executor modes (e.g. kubernetes). While the scheduler described here is local, it may be worth it to eventually align with other needs. For example, the ongoing Pipelines API discussion may require a remote executor to run pipelines, and if these other initiatives have a preference for a remote executor backend, it may be worth considering using the same backend technology for the local Job Scheduler too.
As long as providers use the Scheduler interface to schedule and monitor jobs, their backend choice is not constrained by policy.
State Persistence
To survive
llama-server
restart or node reboot, the job scheduler state should be persisted by the backend.On start, all running Jobs are assumed to be Not Running. Scheduler will attempt to Resume each of them. If Resume is not supported or fails, the Job is deemed Failed.
Each backend will have its own persistence mechanism. For a “naive” scheduler implementation, a simple local file with a lock should suffice.
Resuming a Job
If a provider wants a Job to be resumable, they have to cooperate with the Scheduler as follows:
.closure
attribute to capture any input that will be necessary to resume a Job after server restart. The input should be serializable to JSON. (It’s expected that the input will be a subset of the arguments that provider received during the initial call that triggered a new Job definition.)on_resume
method. This method will be triggered by Scheduler on server start for each Job that is Scheduled or Not Running. The method will receive the.closure
attribute as input. Theon_resume
handler is expected to return a Job object that will be put back into the queue to be scheduled. (With Job UUID retained.)For example,
Procedure
Phases
Phase 0 - Introduce a new Jobs API: update server API definitions; update client code. Allow to not pass Job UUID on
post-training
call. Write a testing stub provider that uses the Scheduler interface. Implement a test set utilizing the stub. At this point a distro can be built to enable/jobs/
endpoint and requests can execute against it (though no actual meaningful work can be done).Phase 1 - Implement Scheduler “naive” backend: Use it for an existing provider (
torchtune
?). Switch the stub test provider to use it. At this point thetorchtune
provider should return Job UUID to the API caller before the training is complete. The user is now able to consult the Jobs API to check the Job status, delete or cancel a Job. They can also pull artifacts for a completed Job.Phase 2 - Implement Tail API: User is now able to attach to a Job via API and watch its activity (log messages, any other events).
Phase 3 - Implement Resuming: TBD if this can be implemented for an existing provider. Do that. Expand the testing stub provider to use it too.
Phase 4 - Clean up old API: (Tech debt) Remove
job_uuid
attribute forpost-training
endpoints (and any other endpoints that may have it). The user now always relies on generic Jobs mechanisms to generate and track Job UUIDs.Phase 5 - Job dependencies: Implement dependencies (
depends
field when POSTing a job) to require another Job completed before proceeding.Risks
The project is aware of a need for some form of asynchronous job management, e.g. as noted in this TODO.
Potential contentions can be anticipated:
job_uuid
as input field forpost-training
)./v1/jobs/
and not/v1/post-training/jobs/
).Beta Was this translation helpful? Give feedback.
All reactions