feat(remote_model): support variable remote backend for model loader #13809

DellCurry · 2025-02-25T07:20:27Z

This is a pr to #12250.

Background

Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.

Of course there are some ways to load directly from remote, such as remote filesystem like NFS, or using runai-model-streamer to load safetensor files from s3 or s3 compatible remote storage. Those methods also have their own drawbacks on network speed and flexibility. For example, runai-model-streamer has several environment variables to config (many issues on it in vllm and its own repo). And if some company has their own remote storage such as HDFS, they could not use runai-model-streamer at all.

Besides, some organizations hope to use KV Database such as Redis to accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:

What this PR do

In order to provide more flexibility, I add a new ModelLoader class named RemoteModelLoader, and introduce a new module named Connector. RemoteModelLoader would create an Connector as its member. RemoteModelLoader would load model first and then fetch weight tensor one by one from Connector.

Connector has two types: KV for KV-database and FS for remote file storage. Both types must implement weight_iterator() to yield weight tensors and pull_files() to download model config flies. I have implemented RedisConnector as an example (most of the serde part copied from LMCache), and move most of the original S3Model to S3Connector. To keep integrity, original RunaiModelStreamerLoader is reserved only for local filesystem.

Connector could also be used for remote prefix cache in the future as LMCache.

Usage

For file-like remote backend such as s3 or other, just replace the --model argument:
vllm serve s3://bucketname/path/to/model/ --port 8000 -tp 4

For kv-like database such as Redis, user need to loading weights to database first. I have already provided a script under example/offline_inference (inspired by ShardedStateLoader):
python3 examples/offline_inference/save_remote_state.py --model /data01/models/Meta-Llama-3-8B/ --remote-model-save-url redis://IP:PORT/Meta-Llama-3-8B -tp 4
After loading tensors, replace --model argument with Redis url and with the same tp value:
vllm serve redis://IP:PORT/Meta-Llama-3-8B -tp 4

You can introduce you own remote backend to Connector, such as HDFS, Amazon DynamoDB, etc.

TBD

I have not considered much about coding check and unit tests. If this PR proved to be helpful, I would fill this part soon.

github-actions · 2025-02-25T07:20:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gaogaoSpark · 2025-02-25T11:47:40Z

If the number of ranks is different, whether to store multiple models according to the rank?
For example, deepseek use two h20s, or one h200

DellCurry · 2025-02-25T12:14:07Z

If the number of ranks is different, whether to store multiple models according to the rank? For example, deepseek use two h20s, or one h200

For file-like remote backend like s3, it does not matter.

For KV database, we can simply regard model with different tp as different models because we split the tensor by rank and then store to database. That is to say, we need to save model weights twice with different name.

Here is an example:
python3 examples/offline_inference/save_remote_state.py --model /path/to/model --remote-model-save-url redis://IP:PORT/deepseek_tp_2 -tp 2
and
python3 examples/offline_inference/save_remote_state.py --model /path/to/model --remote-model-save-url redis://IP:PORT/deepseek_tp_1 -tp 1

Then, run vllm serve redis://IP:PORT/deepseek_tp_1 -tp 1 or vllm serve redis://IP:PORT/deepseek_tp_2 -tp 2

The reason why we do this is (inspired by ShardedStateLoader):

spliting tensor saves much loading time. Each rank only needs to read its own shard.
We hope to use GDR to load weights directly to HBM, which is always not sufficient to load entire checkpoint for huge model.

gaogaoSpark · 2025-02-25T13:03:33Z

thks,got it

mergify · 2025-02-28T08:59:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DellCurry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wangyu <[email protected]>

DellCurry force-pushed the remote_model branch from ace720a to 7c5dd4e Compare February 25, 2025 09:04

DellCurry marked this pull request as ready for review February 25, 2025 09:04

DellCurry requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners February 25, 2025 09:04

mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025

mergify bot added the needs-rebase label Feb 28, 2025

feat(remote_model): support variable remote backend for model loader

1dc1523

Signed-off-by: wangyu <[email protected]>

DellCurry force-pushed the remote_model branch from 7c5dd4e to 1dc1523 Compare February 28, 2025 09:22

mergify bot removed the needs-rebase label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(remote_model): support variable remote backend for model loader #13809

feat(remote_model): support variable remote backend for model loader #13809

DellCurry commented Feb 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 25, 2025

gaogaoSpark commented Feb 25, 2025

DellCurry commented Feb 25, 2025 •

edited

Loading

gaogaoSpark commented Feb 25, 2025

mergify bot commented Feb 28, 2025

feat(remote_model): support variable remote backend for model loader #13809

Are you sure you want to change the base?

feat(remote_model): support variable remote backend for model loader #13809

Conversation

DellCurry commented Feb 25, 2025 • edited by github-actions bot Loading

Background

What this PR do

Usage

TBD

github-actions bot commented Feb 25, 2025

gaogaoSpark commented Feb 25, 2025

DellCurry commented Feb 25, 2025 • edited Loading

gaogaoSpark commented Feb 25, 2025

mergify bot commented Feb 28, 2025

DellCurry commented Feb 25, 2025 •

edited by github-actions bot

Loading

DellCurry commented Feb 25, 2025 •

edited

Loading