Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(remote_model): support variable remote backend for model loader #13809

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DellCurry
Copy link

@DellCurry DellCurry commented Feb 25, 2025

This is a pr to #12250.

Background

Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.

Of course there are some ways to load directly from remote, such as remote filesystem like NFS, or using runai-model-streamer to load safetensor files from s3 or s3 compatible remote storage. Those methods also have their own drawbacks on network speed and flexibility. For example, runai-model-streamer has several environment variables to config (many issues on it in vllm and its own repo). And if some company has their own remote storage such as HDFS, they could not use runai-model-streamer at all.

Besides, some organizations hope to use KV Database such as Redis to accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:
image

What this PR do

In order to provide more flexibility, I add a new ModelLoader class named RemoteModelLoader, and introduce a new module named Connector. RemoteModelLoader would create an Connector as its member. RemoteModelLoader would load model first and then fetch weight tensor one by one from Connector.

Connector has two types: KV for KV-database and FS for remote file storage. Both types must implement weight_iterator() to yield weight tensors and pull_files() to download model config flies. I have implemented RedisConnector as an example (most of the serde part copied from LMCache), and move most of the original S3Model to S3Connector. To keep integrity, original RunaiModelStreamerLoader is reserved only for local filesystem.

Connector could also be used for remote prefix cache in the future as LMCache.

Usage

For file-like remote backend such as s3 or other, just replace the --model argument:
vllm serve s3://bucketname/path/to/model/ --port 8000 -tp 4

For kv-like database such as Redis, user need to loading weights to database first. I have already provided a script under example/offline_inference (inspired by ShardedStateLoader):
python3 examples/offline_inference/save_remote_state.py --model /data01/models/Meta-Llama-3-8B/ --remote-model-save-url redis://IP:PORT/Meta-Llama-3-8B -tp 4
After loading tensors, replace --model argument with Redis url and with the same tp value:
vllm serve redis://IP:PORT/Meta-Llama-3-8B -tp 4

You can introduce you own remote backend to Connector, such as HDFS, Amazon DynamoDB, etc.

TBD

I have not considered much about coding check and unit tests. If this PR proved to be helpful, I would fill this part soon.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@gaogaoSpark
Copy link

If the number of ranks is different, whether to store multiple models according to the rank?
For example, deepseek use two h20s, or one h200

@DellCurry
Copy link
Author

DellCurry commented Feb 25, 2025

If the number of ranks is different, whether to store multiple models according to the rank? For example, deepseek use two h20s, or one h200

For file-like remote backend like s3, it does not matter.

For KV database, we can simply regard model with different tp as different models because we split the tensor by rank and then store to database. That is to say, we need to save model weights twice with different name.

Here is an example:
python3 examples/offline_inference/save_remote_state.py --model /path/to/model --remote-model-save-url redis://IP:PORT/deepseek_tp_2 -tp 2
and
python3 examples/offline_inference/save_remote_state.py --model /path/to/model --remote-model-save-url redis://IP:PORT/deepseek_tp_1 -tp 1

Then, run vllm serve redis://IP:PORT/deepseek_tp_1 -tp 1 or vllm serve redis://IP:PORT/deepseek_tp_2 -tp 2

The reason why we do this is (inspired by ShardedStateLoader):

  1. spliting tensor saves much loading time. Each rank only needs to read its own shard.
  2. We hope to use GDR to load weights directly to HBM, which is always not sufficient to load entire checkpoint for huge model.

@gaogaoSpark
Copy link

thks,got it

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025
Copy link

mergify bot commented Feb 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DellCurry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants