Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading models from an S3 location instead of local path #3090

Open
simon-mo opened this issue Feb 28, 2024 Discussed in #3072 · 6 comments
Open

Loading models from an S3 location instead of local path #3090

simon-mo opened this issue Feb 28, 2024 Discussed in #3072 · 6 comments

Comments

@simon-mo
Copy link
Collaborator

Discussed in #3072

Originally posted by petrosbaltzis February 28, 2024
Hello,

The VLLM library gives the ability to load the model and the tokenizer either from a local folder or directly from HuggingFace.

["python", "-m", "vllm.entrypoints.openai.api_server", \
"--host=0.0.0.0", \
"--port=8080", \
"--model=<local_path>", \
"--tokenizer=<local_path>",
]

I wonder if this functionality can be extended to support s3 locations so that when we initialize the API server, we pass the proper S3 location.

["python", "-m", "vllm.entrypoints.openai.api_server", \
"--host=0.0.0.0", \
"--port=8080", \
"--model=<s3://bucket/prefix>", \
"--tokenizer=<s3://bucket/prefix>",
]

Petros

@ywang96
Copy link
Member

ywang96 commented Feb 29, 2024

Similar to what @ikalista mentioned in original discussion, imo a better way is to mount a model storage to the container for model loading unless we want to rewrite the model loader to directly "stream" from S3 to GPU buffer like what Anyscale did.

@drawnwren
Copy link

Sorry to bump an old issue here, but does this mean that --download-dir does not load weights? Because the docs say "Directory to download and load the weights, default to the default cache dir of huggingface." which makes me think that when I specify --download-dir s3://my-bucket that the bucket is used as a cache. But then this issue makes me think that my interpretation is incorrect?

@ashvinnihalani
Copy link

@ywang96 is anybody working on the direct model loading, do we have a benchmark between mounting and directly loading to memory? Happy to work on this if nobody else is.

@ywang96
Copy link
Member

ywang96 commented Sep 24, 2024

@ywang96 is anybody working on the direct model loading, do we have a benchmark between mounting and directly loading to memory? Happy to work on this if nobody else is.

Not in my knowledge. Feel free to work on this and thanks for your interest!

@samos123
Copy link
Contributor

samos123 commented Nov 7, 2024

@ashvinnihalani are you still working on this? This would be also be helpful to be able to load large models in environments where disk space isn't enough.

The issue with mounting object storage is that it requires the platform operator to provide this. For example, certain K8s setups the user deploying vLLM may not have the required permissions for mounting object storage in their container.

So that's why this would be a very valuable feature.

@omer-dayan
Copy link
Contributor

omer-dayan commented Nov 17, 2024

Hey,
At RunAI we had published an open source tool to stream model weights from an object store like S3 to GPU memory - called RunAI Model Streamer (https://github.com/run-ai/runai-model-streamer)

The Streamer gives 2 main advantages:

  1. Reading from storage with concurrency
  2. Integrating with object storage, S3

You can read further in the whitepaper: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

We have proposed a way to integrate it into vLLM.
#10192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants