Large blob (e.g. model checkpoint) management for stateless LLS client #1130
Replies: 3 comments 3 replies
-
The third and possibly most common scenario is networked storage on the same internal network. In Kubernetes/OpenShift land, this is often handled with PVCs (i.e. networked storage mounted on the local filesystem). That network attached storage may or may not be backed by cloud storage. In the cases where it is, generally the data transfer never actually goes through public pipes because the application runtime and storage are co-located, or because the backing object storage has the same multi-zone replication configuration as the application itself and service discovery / DNS is within-zone. Indeed, security and data privacy concerns alone would prevent most, I would guess, production implementations from trying to always download datasets and models over public internet, which, yes, would be the slowest, least reliable, and most expensive way to do it. In a cloud native application, all of these concerns (should) effectively disappear. (And, as you point out, for a single instance of a standalone application running on a single machine, these concerns usually never come up in the first place). |
Beta Was this translation helpful? Give feedback.
-
I think this is relevant: #1070 ? |
Beta Was this translation helpful? Give feedback.
-
@JamesKunstle Seems to me like there is an added level of complexity rendered to the artifact management story when you take the remote provider case into account |
Beta Was this translation helpful? Give feedback.
-
One of the philosophical objectives of LLS, stated in the documentation, is that it is: "Service-Oriented: REST APIs enforce clean interfaces and enable seamless transitions across different environments." For this to work in practice, the LLS server must be able to manage large data artifacts in an unanonymous and automated way.
LLS should be a black-box from the user's perspective. There ought to be a division of responsibility between logical artifact management and cache storage management.
An anti-pattern is present for the
post_training/supervised_fine_tuning
API via the Torchtune inline provider. Two blobs are pulled from the cloud or from the host filesystem: a model checkpoint (currently Llama only) and a dataset. During fine-tuning, model checkpoints are written to the host filesystem. To recover the final model, a user has to manually extract a checkpoint from the tuning run from the host FS.With this artifact management story, it's not possible to easily implement the following in the SDK:
Logical blob management
Assume that fine-tuning a model via LLS is entirely stateless. The LLS instance may maintain references to cloud storage locations of model checkpoints.
These model references could later be used by inline or remote providers.
This naive workflow is fine- it allows for logically chaining SDK invocations without packing the host disk with checkpoints. However, it has a couple of problems:
job_id
metadata key.modelsio
orblobio
orartifactio
provider type associated with a "frontend" resource (likemodels
resource could have ablobio
remote provider that handles pulling, uncompressing, and converting pulled blobs).Discussion
This is an unaddressed problem space, currently. There are many levels of sophistication possible, from fully-COS (no local caching) to intelligent blob eviction. How should this be handled in a way that fits with the LLS philosophy?
Beta Was this translation helpful? Give feedback.
All reactions