[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

omer-dayan · 2024-11-10T08:46:08Z

The following PR is adding an option to load a model from S3 using Runai Model Streamer as a loader option, as well as from other storage options.

The RunAI Model Streamer is an open source model loader, that is able to stream tensors from any storage (NFS / Local dir / S3 / Object store) with concurrency (https://github.com/run-ai/runai-model-streamer).

Performance benchmarks:

Further reading can be found here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

In this PR we have made the following changes:

Added a new option for --load-format flag - runai_streamer (+ Help description)
When using the runai_streamer vLLM will load the model using RunaiModelStreamerLoader
The RunaiModelStreamerLoader is working only with Safetensors files
The RunaiModelStreamerLoader can be initialized with tunable parameters (Concurrency, and CPU memory limit)
Lazy load of runai-model-streamer package
Add runai-model-streamer to requirements
For config.json and tokenizer files we pull the model (No weights files) from S3 to a temporary memory fs backed directory under /dev/shm
Added documentation of how to use it

After this PR, given a directory on AWS S3 with the model files:

One can run the following command:
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
(Authorization to the S3 endpoint is done through regular AWS S3 authorization mechanism - ~/.aws/credentials / env var / etc)

The tensors will be streamed directly from the S3 into the GPU memory, without going to the storage

github-actions · 2024-11-10T08:46:23Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-13T05:55:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Add it to the docs as well Signed-off-by: OmerD <[email protected]>

Signed-off-by: OmerD <[email protected]>

mergify · 2024-11-15T14:53:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

simon-mo · 2024-11-15T21:16:21Z

@pandyamarut can you help review this PR?

mergify · 2024-11-17T02:04:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

omer-dayan · 2024-11-17T06:29:35Z

vllm/config.py

@@ -191,6 +192,18 @@ def __init__(
                   f"'Please instead use `--hf-overrides '{hf_override!r}'`")
            warnings.warn(DeprecationWarning(msg), stacklevel=2)

+        if is_s3(model):


In addition to weights files, to load model we need config.json and tokeinzer files.
If they are stored in s3 as well, the program needs to read it from there. Which before this change, is not possible.

There are 3 options for implementation for this:

(This image shows in high level the relevant chain of calls in the code)

Implement it like option 1, in the vllm/config.json means a single place of change, small as possible (Current implementation).
Option 2 means implement it in 2 files, seperately, in the tokeinzer and the config file. One may argue its prefered way, as model scope hub integrration is in this layer.
Option 3 means, lets implement it in HuggingFace library, lets make it able to get path from s3 bucket, and read the content from it.

In my opinion option 3 is the most transparent one and the most right one, however they are not interested in expanding their support (huggingface/transformers#19834 (comment))

WDYT?

mergify · 2024-12-17T06:14:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kouroshHakha · 2024-12-17T17:11:41Z

@simon-mo is this good to be merged?

YaliEkstein · 2024-12-18T18:48:30Z

Autoscaling llms could be a whole lot better with this addition! I'm happy to see this PR moving forward.

simon-mo · 2024-12-19T16:46:12Z

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

comaniac

Overall LGTM. Thanks for the work!

vllm/config.py

vllm/transformers_utils/s3_utils.py

vllm/model_executor/model_loader/loader.py

comaniac · 2024-12-19T16:53:37Z

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

Looks like it will fallback to the current model loader if the model name is in HF format (org/model).

omer-dayan · 2024-12-19T18:10:11Z

@simon-mo

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

No, in case the model is in HuggingFace, we download it locally, basically we fallback to the default behavior as @comaniac said.

simon-mo · 2024-12-19T18:23:50Z

@omer-dayan out of curiosity, do you think it's possible to implement something to direct read from the hub in streaming fashion? is there limitations around this?

comaniac

Approved to unblock. My main comment is about docstring. Others LGTM.

vllm/transformers_utils/s3_utils.py

vllm/config.py

docs/source/serving/runai_model_streamer.rst

Signed-off-by: OmerD <[email protected]>

comaniac · 2024-12-19T21:40:10Z

@simon-mo leave to you

omer-dayan · 2024-12-19T21:44:14Z

@comaniac Thanks a lot on the review!

@simon-mo

out of curiosity, do you think it's possible to implement something to direct read from the hub in streaming fashion? is there limitations around this?

Technically I dont see a reason why not.
Although downloading the weights from HuggingFace Hub is not a good practice for production, where you would look for better loading time.
I mean the first thing one would do in order to improve loading time is making sure he is not coupled to the World Wide Web, and the weights are stored closely.

However, notice that HuggingFace Hub is just a git server, and every model is a git repository.
I do think that a good solution would be to stream, like we do in this PR, from any git repo.
That way we would get implicitly "stream from HuggingFace Hub without the need of filesystem" + In production people can store their model on a close git servers

mgoin

LGTM as well, thanks for iterating and making the deps optional

…oader (vllm-project#10192) Signed-off-by: OmerD <[email protected]> Signed-off-by: lucast2021 <[email protected]>

mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 10, 2024

omer-dayan force-pushed the omer/run-loader branch from e4138c7 to 9d3cfa4 Compare November 10, 2024 08:47

omer-dayan changed the title ~~Add RunAI Model Streamer as optional loader.~~ [Core] Add RunAI Model Streamer as optional loader. Nov 10, 2024

omer-dayan changed the title ~~[Core] Add RunAI Model Streamer as optional loader.~~ [Core] Add RunAI Model Streamer as optional loader Nov 11, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 13, 2024

omer-dayan changed the title ~~[Core] Add RunAI Model Streamer as optional loader~~ [Core] Loading model from S3 using RunAI Model Streamer as optional loader Nov 14, 2024

omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from a8d45b4 to 0e5335d Compare November 14, 2024 09:14

Add RunAI Model Streamer as optional loader.

bac1fd7

Add it to the docs as well Signed-off-by: OmerD <[email protected]>

omer-dayan force-pushed the omer/run-loader branch 6 times, most recently from 29aa9d8 to 2a9c864 Compare November 15, 2024 14:42

S3 full support

a8d174e

Signed-off-by: OmerD <[email protected]>

omer-dayan force-pushed the omer/run-loader branch from 2a9c864 to a8d174e Compare November 15, 2024 14:45

mergify bot added the needs-rebase label Nov 15, 2024

Merge branch 'main' into omer/run-loader

efa4705

mergify bot removed the needs-rebase label Nov 16, 2024

mergify bot added the needs-rebase label Nov 17, 2024

omer-dayan commented Nov 17, 2024

View reviewed changes

omer-dayan mentioned this pull request Nov 17, 2024

Loading models from an S3 location instead of local path #3090

Open

Merge branch 'main' into omer/run-loader

b616ffb

mergify bot added the needs-rebase label Dec 17, 2024

Merge branch 'main' into omer/run-loader

a3128e6

mergify bot removed the needs-rebase label Dec 17, 2024

Merge branch 'main' into omer/run-loader

fc19e86

comaniac reviewed Dec 19, 2024

View reviewed changes

omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from 36c3f32 to 30af43e Compare December 19, 2024 18:08

comaniac approved these changes Dec 19, 2024

View reviewed changes

vllm/transformers_utils/s3_utils.py Show resolved Hide resolved

vllm/config.py Show resolved Hide resolved

docs/source/serving/runai_model_streamer.rst Outdated Show resolved Hide resolved

docs/source/serving/runai_model_streamer.rst Outdated Show resolved Hide resolved

omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from 2c2b9f2 to e5fae51 Compare December 19, 2024 21:28

Code review changes

4ea40d1

Signed-off-by: OmerD <[email protected]>

omer-dayan force-pushed the omer/run-loader branch from e5fae51 to 4ea40d1 Compare December 19, 2024 21:34

Merge branch 'main' into omer/run-loader

e2fd6bf

simon-mo approved these changes Dec 19, 2024

View reviewed changes

simon-mo enabled auto-merge (squash) December 19, 2024 23:32

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 19, 2024

mgoin approved these changes Dec 20, 2024

View reviewed changes

Merge branch 'main' into omer/run-loader

d24e2c3

simon-mo merged commit 995f562 into vllm-project:main Dec 20, 2024
76 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

omer-dayan commented Nov 10, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 10, 2024

mergify bot commented Nov 13, 2024

mergify bot commented Nov 15, 2024

simon-mo commented Nov 15, 2024

mergify bot commented Nov 17, 2024

omer-dayan Nov 17, 2024 •

edited

Loading

mergify bot commented Dec 17, 2024

kouroshHakha commented Dec 17, 2024

YaliEkstein commented Dec 18, 2024

simon-mo commented Dec 19, 2024

comaniac left a comment

comaniac commented Dec 19, 2024

omer-dayan commented Dec 19, 2024

simon-mo commented Dec 19, 2024

comaniac left a comment

comaniac commented Dec 19, 2024

omer-dayan commented Dec 19, 2024

mgoin left a comment •

edited

Loading

[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

Conversation

omer-dayan commented Nov 10, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 10, 2024

mergify bot commented Nov 13, 2024

mergify bot commented Nov 15, 2024

simon-mo commented Nov 15, 2024

mergify bot commented Nov 17, 2024

omer-dayan Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Dec 17, 2024

kouroshHakha commented Dec 17, 2024

YaliEkstein commented Dec 18, 2024

simon-mo commented Dec 19, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac commented Dec 19, 2024

omer-dayan commented Dec 19, 2024

simon-mo commented Dec 19, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac commented Dec 19, 2024

omer-dayan commented Dec 19, 2024

mgoin left a comment • edited Loading

Choose a reason for hiding this comment

omer-dayan commented Nov 10, 2024 •

edited by github-actions bot

Loading

omer-dayan Nov 17, 2024 •

edited

Loading

mgoin left a comment •

edited

Loading