Job Management for Llama Stack #1238

booxter · 2025-02-24T23:15:54Z

booxter
Feb 24, 2025

Hi all.

This is a proposal to streamline jobs management through API as well as give providers tools to manage asynchronous and even remote jobs while not blocking access to API.

I converted it from a Google Document using its Markdown copy-paste feature. I hope it worked (it seems to), but if you find some issues, please let me know and I will confirm if it's not a conversion problem.

Context

Providers need to be able to run long running tasks (jobs) asynchronously, letting the user interact with the API, monitor the status of the job etc. This is currently not possible; though asynchronous behavior is modeled in - some - Llama Stack APIs.

For example, this results in API client timeouts when executing training actions against the torchtune provider. This is because the provider doesn’t return control to the API caller until the training run is complete, which may take longer than the API timeout.

This proposal attempts to address this issue with blocking calls to inline providers. It also attempts to streamline the API to interact with long running Jobs.

Note: the proposal does not attempt to force all providers to use the same scheduler backend, and the choice of a backend per provider is baked in in the proposed design. That said, the proposal attempts to build a common jobs interface and useful tools to work with long running tasks (jobs) that can be shared by providers. The actual adoption of the common patterns will depend on the usefulness of the proposed tools.

Note: the proposal doesn’t affect APIs that do not implement Jobs behavior. If there’s an API that doesn’t need a Job entity to return to the user for later status monitoring, artifacts extraction etc., then the document doesn’t apply to it at all.

Terminology

Job: entity to define a piece of work to be performed for a Llama-Stack provider. Includes necessary details to schedule a Job, track its status, access internal state; as well as code that implements the work (main handler as well as callbacks necessary to interact with the scheduler).
Scheduler: a module (part of Llama-Stack server) that is responsible for collecting Jobs to execute from providers; determining their execution order and dependencies; triggering them when the system is ready; tracking their progress, communicating with them - e.g. to pull the latest status or logs; resuming as needed; and finally harvesting the result of Job execution.
Resuming a Job: an operation on a Scheduled / Not Running Job. Depending on the Job implementation, it may or may not support the operation. If it does, the operation is meant to allow the Scheduler to revive a previously Scheduled or Running Job after llama-stack server restart or crash.

Requirements

API users are able to execute long running jobs (training, synthetic data generation, eval, batch inference…) asynchronously.
When such a request is POSTed to API, response returns Job UUID that can be used to track the job.
API supports listing all jobs on the server (perhaps also filtered by API type, status, etc. e.g.: “All Completed”; “All Training jobs”)
API supports the following Job actions: Status, Cancel.
API supports Event streaming for running Jobs.
Users can extract artifacts related to the job via API. (Including logs.)
Users can then remove a completed or cancelled job.
- This will clean up any artifacts related to the job execution.
Jobs should be able to survive llama-stack server crashes or restart.
- This may require cooperation between jobs and the job scheduler (E.g. jobs may have to define how to resume.)
- Some work may be lost depending on the particular provider Resume implementation.
From an API perspective, job behavior is identical regardless of whether an inline or remote provider is used. For remote providers, execution will happen elsewhere and may not be affected by llama-stack server restarts.

Out of Scope

Defining hardware detection mechanism to feed into Job scheduling decisions.
Defining access rules to Jobs API, defining multi-tenancy etc.
Defining how artifacts are handed off between inline and remote providers.

UX

Jobs are a general concept that can be used by different APIs. We’d like to have a uniform way of dealing with jobs. Hence the proposal to introduce a new /jobs endpoint to llama-server.

Note: a new /jobs API is not aligned with the current approach to managing jobs as declared for post-training API, where a job status is extracted using post-training API. This proposal suggests that we deprecate and retire the post-training jobs API endpoints, but it doesn’t require this to happen in the near future (or at all).

Additional Job states may be introduced to cover additional Scheduler states we may need (e.g. “new” or “cancelled” or “scheduled” or “paused”...)

Python SDK

From Python SDK user perspective, the interaction with the server could look as follows:

# create a training job
job_uuid = client.post_training.supervised_fine_tune(...)

# list all jobs
print(client.jobs.list(...))

# inspect status for the job
print(client.jobs.get_status(job_uuid))

# watch logs (next 10 lines)
for line in islice(client.jobs.tail(job_uuid), 10):
  print(line)

# cancel the job ...
client.jobs.cancel(job_uuid)

# ... or wait for its completion
while client.jobs.get_job(job_uuid).status == Running:
  sleep(1)

# extract final job status
print(client.jobs.get_job(job_uuid).status)

# inspect artifacts
for artifact in client.jobs.get_job(job_uuid).artifacts:
   print("Artifact name: {artifact.name}"
   print("Artifact type: {artifact.type}"
   print("Artifact uri: {artifact.uri}"
   curl(artifact.uri)

# delete the job state
client.jobs.delete(job_uuid)

“REST” API

# each flow (training, sdg, ...) has their own trigger, but all return the same job id type
POST /v1/post-training/supervised-fine-tune - schedule a job
{
  depends: [
    # Wait for another job to complete first
    {
      type: job,
      job_uuid: <dependency-uuid>,
      state: completed,
    },
    # The list of supported types of dependencies can be expanded, e.g.
    {
      type: hw,
      gpus: { ... },
    },
  ],
  <training_specific_options>,
} -> 200 {
  job_uuid = id <- I think we should make it <generated-uuid> and not accept / require as input
}

# This is what we already have. If we decide to prolifirate the pattern - each flow implementing
# its own jobs endpoints - the /jobs/ API proposed below would need to sync verbs with all flows.
POST /v1/post-training/job/cancel - cancel job
{
  job_uuid: id
} -> 200 {
}
GET /v1/post-training/job/status - get job status
{
  job_uuid: id
} -> 200 {
  status = new | scheduled | in_progress | completed | failed | (paused?)
}
GET /v1/post-training/job/artifacts - get artifacts
{
  job_uuid: id
} -> 200 {
  checkpoints = [ <- very training specific
    {
      path = ...,
      <training_metrics>, <epoch>, <timestamps>, <metadata>, ...
    },
    ...
  ]
}

# The newly proposed global jobs API (TBD: replacement or addition?)
POST /v1/jobs/{job_uuid}/cancel - cancel job
{
} -> 200 {
}
GET/ v1/jobs - list all jobs
{
  // type: post-training?
} -> 200 {
  [
    {
      job_uuid: ...,
    },
    ...
  ]
}
GET /v1/jobs/{job_uuid} - get job details
{
} -> 200 {
  status = new | scheduled | in_progress | completed | failed | (paused?)
  type: supervised-fine-tuning | eval | sdg | ...,
  # TODO: reconcile with /files API proposal: https://github.com/meta-llama/llama-stack/pull/1070
  artifacts = [
    {
      name = ...,
      tags = [ ... ], <- e.g. [ checkpoint ] or [ log ]
      uri = ..., # assuming a remote but can in theory be `file://`
    },
    ...
  ],
}
DELETE /v1/jobs/{job_uuid} - delete job
{
} -> 200 {
}
# TODO: is it the right way to stream events? Should it be a WebSocket or smth?
GET /v1/jobs/{job_uuid}/tail - listen to events (maybe /listen?)

Implementation

Scheduler

Scheduler will run as long as the llama-server process runs. It is not a separate service and will not maintain its separate lifecycle. E.g. it may be implemented as an async loop that receives requests from providers to run Jobs, decides which Jobs to run at any particular time, establish duplex communication between Jobs, on one hand, and execution environment, on the other, monitors their health, updates about Jobs’ statuses, resumes as needed, harvests logs etc.

The Scheduler is meant to be a common backend service available for all providers that need a long running job execution. Including remote providers.

Interface

To represent a particular Job, a separate class will be introduced. The provider that needs to run a long running task will pass a Job handler to the Scheduler to execute. The Scheduler will initialize a Job object used to track its execution.

The main handler for a Job will receive a set of callbacks that the Job implementation can trigger to communicate with the Scheduler. These are useful to be able to update the Scheduler about a status change, or new logs, artifacts, etc. (TBD exact set of callbacks needed.)

The Python scheduler interface with providers is sketched out in this branch. (See scheduler.py file and post_training.py to understand how the interaction would look like.)

Whenever a provider needs to start a Job, they can, conceptually, do:

async def handler(on_log_message_cb, on_status_change_cb, on_artifact_change_cb):
  on_log_message_cb('Starting work')
  results = await do_work(...)
  on_log_message_cb('Work done')
  for result in results:
    on_artifact_change_cb(result)
  on_log_message_cb('Artifacts registered')

job_uuid = self._scheduler.schedule(Job(handler))
return APIResponse(job_uuid=job_uuid)

It is assumed that all providers will schedule their long running tasks through the interface, which will guarantee consistent user experience (e.g. consistent Job UUID format, accessibility of job status, artifacts via a global Jobs API, ability to cancel or remove Jobs etc.)

Depending on the type of the provider and the job involved to maintain a backend workload, the actual implementation of the job handler may be as simple as waiting for a provider job ID to complete on a remote service endpoint (for remote providers).

Backend

This proposal doesn’t assume a particular scheduler backend, as long as it can implement the Scheduler interface. Each provider will be able to choose a particular scheduler backend as needed. (With the caveat that there’s a good reason to bring a new one and not reuse an existing one.)

The particular backends of choice are to be decided. Possible options include:

In-tree “naive” (reference?) implementation.
Huey. Airflow. Dask. Dramatiq. Kubeflow Pipelines.

The particular backend will be hidden behind Scheduler facade and will be designed as swappable (driven by the configuration file).

I expect the first scheduler implementation will either be completely in-tree or use a lightweight library to schedule tasks. We’ll need to look at other available options closer to make a decision.

Some of the backends listed may support non-local executor modes (e.g. kubernetes). While the scheduler described here is local, it may be worth it to eventually align with other needs. For example, the ongoing Pipelines API discussion may require a remote executor to run pipelines, and if these other initiatives have a preference for a remote executor backend, it may be worth considering using the same backend technology for the local Job Scheduler too.

As long as providers use the Scheduler interface to schedule and monitor jobs, their backend choice is not constrained by policy.

State Persistence

To survive llama-server restart or node reboot, the job scheduler state should be persisted by the backend.

On start, all running Jobs are assumed to be Not Running. Scheduler will attempt to Resume each of them. If Resume is not supported or fails, the Job is deemed Failed.

Each backend will have its own persistence mechanism. For a “naive” scheduler implementation, a simple local file with a lock should suffice.

Resuming a Job

If a provider wants a Job to be resumable, they have to cooperate with the Scheduler as follows:

When creating a Job, initialize its .closure attribute to capture any input that will be necessary to resume a Job after server restart. The input should be serializable to JSON. (It’s expected that the input will be a subset of the arguments that provider received during the initial call that triggered a new Job definition.)
Implement the on_resume method. This method will be triggered by Scheduler on server start for each Job that is Scheduled or Not Running. The method will receive the .closure attribute as input. The on_resume handler is expected to return a Job object that will be put back into the queue to be scheduled. (With Job UUID retained.)

For example,

class TrainImpl:
  async def supervised_fine_tune(...) -> PostTrainingJob:
    job = Job(...)
    job.closure = {'training_config': training_config, ...}
    scheduler.schedule(job)

  async def on_resume(self, closure):
    job = Job(*closure)
    job.closure = closure
    scheduler.schedule(job)


class Scheduler:
  def __init__(self):
    jobs = self.distill_jobs_from_storage()
    for job in jobs:
      provider = get_provider(job['type'])
      if hasattr(provider, 'on_resume'):
        handler = provider.on_resume(job['closure'])
        job = Job(handler)
        job.uuid = job['uuid'] # retain UUID
        self.schedule(job)

Procedure

Phases

Phase 0 - Introduce a new Jobs API: update server API definitions; update client code. Allow to not pass Job UUID on post-training call. Write a testing stub provider that uses the Scheduler interface. Implement a test set utilizing the stub. At this point a distro can be built to enable /jobs/ endpoint and requests can execute against it (though no actual meaningful work can be done).

Phase 1 - Implement Scheduler “naive” backend: Use it for an existing provider (torchtune?). Switch the stub test provider to use it. At this point the torchtune provider should return Job UUID to the API caller before the training is complete. The user is now able to consult the Jobs API to check the Job status, delete or cancel a Job. They can also pull artifacts for a completed Job.

Phase 2 - Implement Tail API: User is now able to attach to a Job via API and watch its activity (log messages, any other events).

Phase 3 - Implement Resuming: TBD if this can be implemented for an existing provider. Do that. Expand the testing stub provider to use it too.

Phase 4 - Clean up old API: (Tech debt) Remove job_uuid attribute for post-training endpoints (and any other endpoints that may have it). The user now always relies on generic Jobs mechanisms to generate and track Job UUIDs.

Phase 5 - Job dependencies: Implement dependencies (depends field when POSTing a job) to require another Job completed before proceeding.

Risks

The project is aware of a need for some form of asynchronous job management, e.g. as noted in this TODO.

Potential contentions can be anticipated:

In the proposed changes to existing APIs (e.g. to not require or remove job_uuid as input field for post-training).
- It is not a requirement for the proposal but would make for a better user experience.
In introducing a first order generic Jobs API (instead of managing each job separately under corresponding API endpoints (meaning, calling /v1/jobs/ and not /v1/post-training/jobs/).
- Not a requirement by itself, but would make for a better user experience.
Job dependencies may overload the API interface.
- Clients can always spin on jobs waiting for them to complete before proceeding.
- Job dependencies are not required for the existing API flow and may be postponed to later.

booxter · 2025-02-24T23:59:29Z

booxter
Feb 24, 2025
Author

The draft implementation of this (very incomplete and buggy) can be found at: booxter#1 (posted as PR in my fork for easier reader experience); hopefully it's helpful when trying to make sense of the write-up above.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Management for Llama Stack #1238

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Job Management for Llama Stack #1238

booxter Feb 24, 2025

Context

Terminology

Requirements

Out of Scope

UX

Python SDK

“REST” API

Implementation

Scheduler

Interface

Backend

State Persistence

Resuming a Job

Procedure

Phases

Risks

Replies: 1 comment

booxter Feb 24, 2025 Author

booxter
Feb 24, 2025

booxter
Feb 24, 2025
Author