Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #9

Merged
merged 60 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
3729b61
Update poetry lock
XkunW May 28, 2024
fc49b78
Add docker image for default environment
XkunW May 30, 2024
e597136
Update docker image to not create a virtural env
XkunW Jun 4, 2024
18647db
Update verison
XkunW Jun 4, 2024
b1571a0
Test container with single node llama 3
XkunW Jun 5, 2024
1dfd1c8
Add vllm-nccl-cu12 as dependency
XkunW Jun 5, 2024
00c469c
Update Dockerfile
XkunW Jun 5, 2024
966ed93
Move nccl file location
XkunW Jun 6, 2024
f986875
Update poetry lock, add mistral models, update default env to use sin…
XkunW Jun 6, 2024
8731c93
Update README installation instructions
XkunW Jun 6, 2024
2b9bdf4
Update env var name
XkunW Jun 6, 2024
af0ad0c
Move Poetry cache dir to working dir
XkunW Jun 11, 2024
8109795
Clone from main
XkunW Jun 12, 2024
b669354
Update to use vLLM 0.5.0
XkunW Jun 13, 2024
d28f03f
Add vim installation, remove cache directory as it is unused
XkunW Jun 13, 2024
c4dbed0
Update examplesto include VLM completions, add profiling scripts
XkunW Jun 13, 2024
c9bd432
Added support for VLMs - llava-1.5 and llava-next, updated default en…
XkunW Jun 13, 2024
f60c3f1
Fixed data type override logic, added --time argument
XkunW Jun 13, 2024
045fc81
Accidentally removed variant argument in previous commits, adding it …
XkunW Jun 17, 2024
07fbe33
Set default image input args for VLM models
XkunW Jun 17, 2024
57087f9
Update Llava 1.5 README
XkunW Jun 17, 2024
e88da1f
Update models README
XkunW Jun 17, 2024
2e465e1
Update README.md to reflect refactoring in examples folder
XkunW Jun 17, 2024
65bf554
Update README.md to reflect factored changes
XkunW Jun 17, 2024
4b608be
refactoring v1.
Jun 20, 2024
9e79c31
removed launched server from each models directory.
Jun 20, 2024
96a7233
removed MODEL_EXT
Jun 20, 2024
9e42483
Update config files, consolidate all job launching bash scripts to sa…
XkunW Jun 21, 2024
7e64ecb
Fix file path issues with the consolidated launch script
XkunW Jun 24, 2024
4054b3b
Update README according to refactor
XkunW Jun 24, 2024
fc84a0b
Update model variant names for llama2, added CodeLlama
XkunW Jul 6, 2024
1f1cec7
Bump version
XkunW Jul 6, 2024
6b116e8
Update version
XkunW Jul 25, 2024
b5ad503
Add CLI, update repo into a package, added llama 3.1 and gemma 2
XkunW Jul 30, 2024
a558b96
Bump version to 1.0.0
XkunW Jul 30, 2024
3dbbcb5
Merge branch 'develop' into feature/cli
XkunW Jul 30, 2024
2bde47b
Deleted old files unresolved from merge, delete a comment
XkunW Jul 30, 2024
f16b837
Merge pull request #8 from VectorInstitute/feature/cli
XkunW Jul 30, 2024
ec5dd56
Don't create venv when building docker image
XkunW Jul 30, 2024
6bb439e
Minor bug fixes for CLI, added Phi-3, updated VLM launching logic for…
XkunW Jul 31, 2024
6783f12
Add phi-3 vision, update multi-node launch with the same command as s…
XkunW Jul 31, 2024
a9f04cf
Remove old comments, use pipeline parallel for multinode
XkunW Jul 31, 2024
70b6aef
Remove pipeline parallelism as only few architectures supported, limi…
XkunW Aug 1, 2024
e12be31
Turn down gpu utilization to 95% as 100% seems more likely to hit CUD…
XkunW Aug 1, 2024
1e1619b
Add missing brackets
XkunW Aug 1, 2024
9b42ed3
Update model family name extraction logic
XkunW Aug 1, 2024
d1126f1
Update vllm version to 0.5.4, change vec-inf version to 0.3.0
XkunW Aug 6, 2024
e6904ad
Update default variant to chat variants, add max_model_len option, re…
XkunW Aug 8, 2024
8282cc5
Add missing input param to launch for max_model_len
XkunW Aug 8, 2024
124e898
Add served model name to replace the full model weights path when sen…
XkunW Aug 8, 2024
46fa15b
Add available server status
XkunW Aug 8, 2024
4c66ec7
Move default config from model family level to model variant level, u…
XkunW Aug 21, 2024
782ed34
Configure default to model variant level, launch command now takes fu…
XkunW Aug 26, 2024
465445b
Enable pipeline parallelism
XkunW Aug 27, 2024
90a4a1c
Add error handling and append FAILED reason
XkunW Aug 27, 2024
dcb7d1c
Retrieve URL for every model instance
XkunW Aug 27, 2024
d317a25
Change local import structure to package mode, ignore built files
XkunW Aug 27, 2024
406bba6
Remove old config files, update README
XkunW Aug 27, 2024
f9248f4
Add comment on where to find the slurm logs
XkunW Aug 28, 2024
254df3b
Update installation commands in Dockerfile
XkunW Aug 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ dmypy.json
*.err

# Server url files
.vLLM*
*_url

logs/

Expand All @@ -151,4 +151,7 @@ slurm/
scripts/

# vLLM bug reporting files
collect_env.py
collect_env.py

# build files
dist/
12 changes: 3 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -53,20 +53,14 @@ RUN python3.10 -m pip install --upgrade pip
# Install Poetry using Python 3.10
RUN python3.10 -m pip install poetry

# Clone the repository
RUN git clone https://github.com/VectorInstitute/vector-inference /vec-inf

# Set the working directory
WORKDIR /vec-inf

# Configure Poetry to not create virtual environments
# Don't create venv
RUN poetry config virtualenvs.create false

# Update Poetry lock file if necessary
RUN poetry lock

# Install project dependencies via Poetry
RUN poetry install
# Install vec-inf
RUN python3.10 -m pip install vec-inf[dev]

# Install Flash Attention 2 backend
RUN python3.10 -m pip install flash-attn --no-build-isolation
Expand Down
76 changes: 36 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,58 @@
# Vector Inference: Easy inference on Slurm clusters
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the config files in the `models` folder and the environment variables in the model launching scripts accordingly.
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec-inf/launch_server.sh), [`vllm.slurm`](vec-inf/vllm.slurm), [`multinode_vllm.slurm`](vec-inf/multinode_vllm.slurm) and [`models.csv`](vec-inf/models/models.csv) accordingly.

## Installation
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, you can go to the next section as we have a default container environment in place. Otherwise, you might need up to 10GB of storage to setup your own virtual environment. The following steps needs to be run only once for each user.

1. Setup the virtual environment for running inference servers, run
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
```bash
bash venv.sh
pip install vec-inf
```
More details can be found in [venv.sh](venv.sh), make sure to adjust the commands to your environment if you're not using the Vector cluster.
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package

2. Locate your virtual environment by running
## Launch an inference server
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
```bash
poetry env info --path
vec-inf launch Meta-Llama-3.1-8B-Instruct
```
You should see an output like the following:

1. OPTIONAL: It is recommended to enable [FlashAttention](https://github.com/Dao-AILab/flash-attention) backend for better performance, run the following commands inside your environment to install:
```bash
pip install wheel

# Change the path according to your environment, this is an example for the Vector cluster
export CUDA_HOME=/pkgs/cuda-12.3
<img width="450" alt="launch_img" src="https://github.com/user-attachments/assets/557eb421-47db-4810-bccd-c49c526b1b43">

pip install flash-attn --no-build-isolation
pip install vllm-flash-attn
```
The model would be launched using the [default parameters](vec-inf/models/models.csv), you can override these values by providing additional options, use `--help` to see the full list.
If you'd like to see the Slurm logs, they are located in the `.vec-inf-logs` folder in your home directory. The log folder path can be modified by using the `--log-dir` option.

## Launch an inference server
We will use the Llama 3 model as example, to launch an inference server for Llama 3 8B, run
You can check the inference server status by providing the Slurm job ID to the `status` command:
```bash
bash src/launch_server.sh --model-family llama3
vec-inf status 13014393
```

You should see an output like the following:
> Job Name: vLLM/Meta-Llama-3-8B
>
> Partition: a40
>
> Generic Resource Scheduling: gpu:1
>
> Data Type: auto
>
> Submitted batch job 12217446

If you want to use your own virtual environment, you can run this instead:
```bash
bash src/launch_server.sh --model-family llama3 --venv $(poetry env info --path)
```
By default, the `launch_server.sh` script is set to use the 8B variant for Llama 3 based on the config file in `models/llama3` folder, you can switch to other variants with the `--model-variant` argument, and make sure to change the requested resource accordingly. More information about the flags and customizations can be found in the [`models`](models) folder. The inference server is compatible with the OpenAI `Completion` and `ChatCompletion` API. You can inspect the Slurm output files to check the inference server status.

Here is a more complicated example that launches a model variant using multiple nodes, say we want to launch Mixtral 8x22B, run
<img width="450" alt="status_img" src="https://github.com/user-attachments/assets/7385b9ca-9159-4ca9-bae2-7e26d80d9747">

There are 5 possible states:

* **PENDING**: Job submitted to Slurm, but not executed yet. Job pending reason will be shown.
* **LAUNCHING**: Job is running but the server is not ready yet.
* **READY**: Inference server running and ready to take requests.
* **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
* **SHUTDOWN**: Inference server is shutdown/cancelled.

Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.

Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
```bash
bash src/launch_server.sh --model-family mixtral --model-variant 8x22B-v0.1 --num-nodes 2 --num-gpus 4
vec-inf shutdown 13014393

> Shutting down model with Slurm Job ID: 13014393
```

And for launching a multimodal model, here is an example for launching LLaVa-NEXT Mistral 7B (default variant)
You call view the full list of available models by running the `list` command:
```bash
bash src/launch_server.sh --model-family llava-next --is-vlm
vec-inf list
```
<img width="1200" alt="list_img" src="https://github.com/user-attachments/assets/a4f0d896-989d-43bf-82a2-6a6e5d0d288f">

`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.

## Send inference requests
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
Expand All @@ -69,4 +65,4 @@ If you want to run inference from your local device, you can open a SSH tunnel t
```bash
ssh -L 8081:172.17.8.29:8081 [email protected] -N
```
The example provided above is for the vector cluster, change the variables accordingly for your environment
Where the last number in the URL is the GPU number (gpu029 in this case). The example provided above is for the vector cluster, change the variables accordingly for your environment
48 changes: 0 additions & 48 deletions models/README.md

This file was deleted.

5 changes: 0 additions & 5 deletions models/codellama/config.sh

This file was deleted.

5 changes: 0 additions & 5 deletions models/command-r/config.sh

This file was deleted.

5 changes: 0 additions & 5 deletions models/dbrx/config.sh

This file was deleted.

5 changes: 0 additions & 5 deletions models/llama2/config.sh

This file was deleted.

5 changes: 0 additions & 5 deletions models/llama3/config.sh

This file was deleted.

23 changes: 0 additions & 23 deletions models/llava-1.5/chat_template.jinja

This file was deleted.

10 changes: 0 additions & 10 deletions models/llava-1.5/config.sh

This file was deleted.

23 changes: 0 additions & 23 deletions models/llava-next/chat_template.jinja

This file was deleted.

10 changes: 0 additions & 10 deletions models/llava-next/config.sh

This file was deleted.

7 changes: 0 additions & 7 deletions models/mistral/README.md

This file was deleted.

5 changes: 0 additions & 5 deletions models/mistral/config.sh

This file was deleted.

5 changes: 0 additions & 5 deletions models/mixtral/config.sh

This file was deleted.

Loading
Loading