-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
e4138c7
to
9d3cfa4
Compare
This pull request has merge conflicts that must be resolved before it can be |
a8d45b4
to
0e5335d
Compare
Add it to the docs as well Signed-off-by: OmerD <[email protected]>
29aa9d8
to
2a9c864
Compare
Signed-off-by: OmerD <[email protected]>
2a9c864
to
a8d174e
Compare
This pull request has merge conflicts that must be resolved before it can be |
@pandyamarut can you help review this PR? |
This pull request has merge conflicts that must be resolved before it can be |
vllm/config.py
Outdated
@@ -191,6 +192,18 @@ def __init__( | |||
f"'Please instead use `--hf-overrides '{hf_override!r}'`") | |||
warnings.warn(DeprecationWarning(msg), stacklevel=2) | |||
|
|||
if is_s3(model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to weights files, to load model we need config.json
and tokeinzer files.
If they are stored in s3 as well, the program needs to read it from there. Which before this change, is not possible.
There are 3 options for implementation for this:
(This image shows in high level the relevant chain of calls in the code)
Implement it like option 1, in the vllm/config.json
means a single place of change, small as possible (Current implementation).
Option 2 means implement it in 2 files, seperately, in the tokeinzer and the config file. One may argue its prefered way, as model scope hub integrration is in this layer.
Option 3 means, lets implement it in HuggingFace library, lets make it able to get path from s3 bucket, and read the content from it.
In my opinion option 3 is the most transparent one and the most right one, however they are not interested in expanding their support (huggingface/transformers#19834 (comment))
WDYT?
This pull request has merge conflicts that must be resolved before it can be |
@simon-mo is this good to be merged? |
Autoscaling llms could be a whole lot better with this addition! I'm happy to see this PR moving forward. |
Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Thanks for the work!
Looks like it will fallback to the current model loader if the model name is in HF format (org/model). |
36c3f32
to
30af43e
Compare
@omer-dayan out of curiosity, do you think it's possible to implement something to direct read from the hub in streaming fashion? is there limitations around this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved to unblock. My main comment is about docstring. Others LGTM.
2c2b9f2
to
e5fae51
Compare
Signed-off-by: OmerD <[email protected]>
e5fae51
to
4ea40d1
Compare
@simon-mo leave to you |
@comaniac Thanks a lot on the review!
Technically I dont see a reason why not. However, notice that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks for iterating and making the deps optional
…oader (vllm-project#10192) Signed-off-by: OmerD <[email protected]> Signed-off-by: lucast2021 <[email protected]>
The following PR is adding an option to load a model from S3 using Runai Model Streamer as a loader option, as well as from other storage options.
The RunAI Model Streamer is an open source model loader, that is able to stream tensors from any storage (NFS / Local dir / S3 / Object store) with concurrency (https://github.com/run-ai/runai-model-streamer).
Performance benchmarks:
Further reading can be found here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf
In this PR we have made the following changes:
--load-format
flag -runai_streamer
(+ Help description)runai_streamer
vLLM will load the model usingRunaiModelStreamerLoader
RunaiModelStreamerLoader
is working only with Safetensors filesRunaiModelStreamerLoader
can be initialized with tunable parameters (Concurrency, and CPU memory limit)runai-model-streamer
packagerunai-model-streamer
to requirementsconfig.json
and tokenizer files we pull the model (No weights files) from S3 to a temporary memory fs backed directory under/dev/shm
After this PR, given a directory on AWS S3 with the model files:
One can run the following command:
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
(Authorization to the S3 endpoint is done through regular AWS S3 authorization mechanism -
~/.aws/credentials
/ env var / etc)The tensors will be streamed directly from the S3 into the GPU memory, without going to the storage