-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(remote_model): support variable remote backend for model loader #13809
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
ace720a
to
7c5dd4e
Compare
If the number of ranks is different, whether to store multiple models according to the rank? |
For file-like remote backend like s3, it does not matter. For KV database, we can simply regard model with different Here is an example: Then, run The reason why we do this is (inspired by
|
thks,got it |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wangyu <[email protected]>
7c5dd4e
to
1dc1523
Compare
This is a pr to #12250.
Background
Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.
Of course there are some ways to load directly from remote, such as remote filesystem like
NFS
, or usingrunai-model-streamer
to load safetensor files froms3
ors3 compatible
remote storage. Those methods also have their own drawbacks on network speed and flexibility. For example,runai-model-streamer
has several environment variables to config (many issues on it in vllm and its own repo). And if some company has their own remote storage such asHDFS
, they could not userunai-model-streamer
at all.Besides, some organizations hope to use KV Database such as

Redis
to accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:What this PR do
In order to provide more flexibility, I add a new
ModelLoader
class namedRemoteModelLoader
, and introduce a new module namedConnector
.RemoteModelLoader
would create anConnector
as its member.RemoteModelLoader
would load model first and then fetch weight tensor one by one fromConnector
.Connector
has two types:KV
for KV-database andFS
for remote file storage. Both types must implementweight_iterator()
to yield weight tensors andpull_files()
to download model config flies. I have implementedRedisConnector
as an example (most of theserde
part copied fromLMCache
), and move most of the originalS3Model
toS3Connector
. To keep integrity, originalRunaiModelStreamerLoader
is reserved only for local filesystem.Connector
could also be used for remote prefix cache in the future asLMCache
.Usage
For file-like remote backend such as s3 or other, just replace the
--model
argument:vllm serve s3://bucketname/path/to/model/ --port 8000 -tp 4
For kv-like database such as
Redis
, user need to loading weights to database first. I have already provided a script under example/offline_inference (inspired byShardedStateLoader
):python3 examples/offline_inference/save_remote_state.py --model /data01/models/Meta-Llama-3-8B/ --remote-model-save-url redis://IP:PORT/Meta-Llama-3-8B -tp 4
After loading tensors, replace
--model
argument withRedis
url and with the sametp
value:vllm serve redis://IP:PORT/Meta-Llama-3-8B -tp 4
You can introduce you own remote backend to
Connector
, such asHDFS
,Amazon DynamoDB
, etc.TBD
I have not considered much about coding check and unit tests. If this PR proved to be helpful, I would fill this part soon.