Large blob (e.g. model checkpoint) management for stateless LLS client #1130

JamesKunstle · 2025-02-17T20:17:07Z

JamesKunstle
Feb 17, 2025

One of the philosophical objectives of LLS, stated in the documentation, is that it is: "Service-Oriented: REST APIs enforce clean interfaces and enable seamless transitions across different environments." For this to work in practice, the LLS server must be able to manage large data artifacts in an unanonymous and automated way.

LLS should be a black-box from the user's perspective. There ought to be a division of responsibility between logical artifact management and cache storage management.

An anti-pattern is present for the post_training/supervised_fine_tuning API via the Torchtune inline provider. Two blobs are pulled from the cloud or from the host filesystem: a model checkpoint (currently Llama only) and a dataset. During fine-tuning, model checkpoints are written to the host filesystem. To recover the final model, a user has to manually extract a checkpoint from the tuning run from the host FS.

With this artifact management story, it's not possible to easily implement the following in the SDK:

model_train() -> evaluate_checkpoints() -> serve_checkpoint_inline_provider()

Logical blob management

Assume that fine-tuning a model via LLS is entirely stateless. The LLS instance may maintain references to cloud storage locations of model checkpoints.

# during model tuning
for epoch in range(n_epochs):
     for batch in iter(train_loader):
         ....
    
    # upload model checkpoint to cloud
    oci_upload(model.state_dict(), registry="docker://...")

    # register model ID with LLS metadata store for later reference
    models_api.register(model_id=f"{job_id}_epoch{epoch}", ref="docker://...", tags=[f"{job_id}_checkpoint"])

These model references could later be used by inline or remote providers.

client.post_training.supervised_fine_tune(
    model_id="meta-llama/Llama-3.1-8b-instruct",
    job_id="12345",
    ...
)

# references to model checkpoints are stored in the LLS metadata store
last_model_checkpoint = client.models.get(tag_match=f"12345_checkpoint", revision="latest")

client.inference.chat_completion(
     model_id=last_model_checkpoint,
    ....
)

This naive workflow is fine- it allows for logically chaining SDK invocations without packing the host disk with checkpoints. However, it has a couple of problems:

~50-100GB checkpoint I/O to OCI, S3, etc. is very high overhead, both to push and to pull.
Artifact ownership is unauthenticated: notice that the references to the model checkpoints are via the job_id metadata key.
Blob I/O should be general: models could be on HF, S3, OCI, Git LFS, etc. It might make sense to have a modelsio or blobio or artifactio provider type associated with a "frontend" resource (like models resource could have a blobio remote provider that handles pulling, uncompressing, and converting pulled blobs).

Discussion

This is an unaddressed problem space, currently. There are many levels of sophistication possible, from fully-COS (no local caching) to intelligent blob eviction. How should this be handled in a way that fits with the LLS philosophy?

anastasds · 2025-02-17T22:02:59Z

anastasds
Feb 17, 2025

from the cloud or from the host filesystem

The third and possibly most common scenario is networked storage on the same internal network. In Kubernetes/OpenShift land, this is often handled with PVCs (i.e. networked storage mounted on the local filesystem). That network attached storage may or may not be backed by cloud storage. In the cases where it is, generally the data transfer never actually goes through public pipes because the application runtime and storage are co-located, or because the backing object storage has the same multi-zone replication configuration as the application itself and service discovery / DNS is within-zone.

Indeed, security and data privacy concerns alone would prevent most, I would guess, production implementations from trying to always download datasets and models over public internet, which, yes, would be the slowest, least reliable, and most expensive way to do it. In a cloud native application, all of these concerns (should) effectively disappear. (And, as you point out, for a single instance of a standalone application running on a single machine, these concerns usually never come up in the first place).

1 reply

JamesKunstle Feb 17, 2025
Author

Yeah that's a good point. Network I/O COS, to me, is a deployment detail that should be assumed to be obfuscated from an LLS client in most cases. In the most general case, an LLS client makes an anonymous, stateless request to LLS. Therefore the user should not be assumed to have host COS access, IMO.

booxter · 2025-02-18T17:15:48Z

booxter
Feb 18, 2025

I think this is relevant: #1070 ?

1 reply

JamesKunstle Feb 18, 2025
Author

Yeah definitely relevant. File and blob management is critical. When blobs get big enough (>GB range) though, I think there needs to be a strategy for representing them logically in code and caching them as possible, while also designing a system such that the following can be avoided:

0. LLS host has a 512GB primary disk partition for adhoc client storage.
1. Client 'A' runs training, creates 250GB of model checkpoints.
2. Client 'B' runs training, creating 250GB of model checkpoints.
3. Client 'C' attempts to run training, but run fails in situ because the disk is full.

This is a problem if all the clients are anonymous because the 250GB run requirements from the previous two runs aren't necessarily needed anymore- they should have been cleaned up so Client 'C' could be successful.

It'd make more sense for something like the following to happen:

0. LLS host has a 512GB primary disk partition for adhoc client storage.
1. Client 'A' runs training, creates 250GB of model checkpoints. Important blobs (occasional epochs) are stored in COS.
2. Client 'B' runs training, creating 250GB of model checkpoints. Important blobs (occasional epochs) are stored in COS.
3. Client 'C' attempts to run training, oldest blobs are greedily deleted so this run succeeds. If 250GB of checkpoints are written, all of Client 'A's checkpoints should have been removed from cache.
4. Client 'A' attempts to evaluate one of their checkpoints. It was removed from cache, so it needs to be pulled from COS.

jaideepr97 · 2025-02-19T18:42:38Z

jaideepr97
Feb 19, 2025

@JamesKunstle Seems to me like there is an added level of complexity rendered to the artifact management story when you take the remote provider case into account
#1070 (review)

1 reply

JamesKunstle Feb 19, 2025
Author

Yeah you're totally right. I think the remote vs. inline provider abstraction is fairly leaky for this use-case. From the user's perspective, the plumbing should be logical- e.g. if we pass a model checkpoint and a dataset reference to a remote training provider, that provider needs to marshal the artifacts itself somehow, either referring to them in COS or POSTing them to the remote implementation endpoint or something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large blob (e.g. model checkpoint) management for stateless LLS client #1130

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Large blob (e.g. model checkpoint) management for stateless LLS client #1130

JamesKunstle Feb 17, 2025

Logical blob management

Discussion

Replies: 3 comments · 3 replies

anastasds Feb 17, 2025

JamesKunstle Feb 17, 2025 Author

booxter Feb 18, 2025

JamesKunstle Feb 18, 2025 Author

jaideepr97 Feb 19, 2025

JamesKunstle Feb 19, 2025 Author

JamesKunstle
Feb 17, 2025

Replies: 3 comments 3 replies

anastasds
Feb 17, 2025

JamesKunstle Feb 17, 2025
Author

booxter
Feb 18, 2025

JamesKunstle Feb 18, 2025
Author

jaideepr97
Feb 19, 2025

JamesKunstle Feb 19, 2025
Author