Skip to content

Commit

Permalink
Merge pull request #18 from VectorInstitute/develop
Browse files Browse the repository at this point in the history
v0.4.0
  • Loading branch information
XkunW authored Nov 28, 2024
2 parents 97f22a6 + f74c4f6 commit d221dae
Show file tree
Hide file tree
Showing 11 changed files with 367 additions and 143 deletions.
10 changes: 8 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -48,19 +48,25 @@ RUN wget https://bootstrap.pypa.io/get-pip.py && \
rm get-pip.py

# Ensure pip for Python 3.10 is used
RUN python3.10 -m pip install --upgrade pip
RUN python3.10 -m pip install --upgrade pip setuptools wheel

# Install Poetry using Python 3.10
RUN python3.10 -m pip install poetry

# Don't create venv
RUN poetry config virtualenvs.create false

# Set working directory
WORKDIR /vec-inf

# Copy current directory
COPY . /vec-inf

# Update Poetry lock file if necessary
RUN poetry lock

# Install vec-inf
RUN python3.10 -m pip install vec-inf[dev]
RUN poetry install --extras "dev"

# Install Flash Attention 2 backend
RUN python3.10 -m pip install flash-attn --no-build-isolation
Expand Down
27 changes: 23 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,23 @@ pip install vec-inf
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package

## Launch an inference server
### `launch` command
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
```bash
vec-inf launch Meta-Llama-3.1-8B-Instruct
```
You should see an output like the following:

<img width="400" alt="launch_img" src="https://github.com/user-attachments/assets/557eb421-47db-4810-bccd-c49c526b1b43">
<img width="700" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5">

The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional options, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), you'll need to specify all model launching related options to run a successful run.
The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional parameters, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
* Your model weights directory should contain HF format weights.
* The following launch parameters will conform to default value if not specified: `--max-num-seqs`, `--partition`, `--data-type`, `--venv`, `--log-dir`, `--model-weights-parent-dir`, `--pipeline-parallelism`, `--enforce-eager`. All other launch parameters need to be specified for custom models.
* Example for setting the model weights parent directory: `--model-weights-parent-dir /h/user_name/my_weights`.
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).

### `status` command
You can check the inference server status by providing the Slurm job ID to the `status` command:
```bash
vec-inf status 13014393
Expand All @@ -38,24 +45,36 @@ There are 5 possible states:

Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.

### `metrics` command
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
```bash
vec-inf metrics 13014393
```

And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.

<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/e5ff2cd5-659b-4c88-8ebc-d8f3fdc023a4">

### `shutdown` command
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
```bash
vec-inf shutdown 13014393

> Shutting down model with Slurm Job ID: 13014393
```

### `list` command
You call view the full list of available models by running the `list` command:
```bash
vec-inf list
```
<img width="1200" alt="list_img" src="https://github.com/user-attachments/assets/a4f0d896-989d-43bf-82a2-6a6e5d0d288f">
<img width="900" alt="list_img" src="https://github.com/user-attachments/assets/7cb2b2ac-d30c-48a8-b773-f648c27d9de2">

You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
```bash
vec-inf list Meta-Llama-3.1-70B-Instruct
```
<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/5dec7a33-ba6b-490d-af47-4cf7341d0b42">
<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/30e42ab7-dde2-4d20-85f0-187adffefc3d">

`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.

Expand Down
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "vec-inf"
version = "0.3.3"
version = "0.4.0"
description = "Efficient LLM inference on Slurm clusters using vLLM."
authors = ["Marshall Wang <[email protected]>"]
license = "MIT license"
Expand All @@ -11,8 +11,9 @@ python = "^3.10"
requests = "^2.31.0"
click = "^8.1.0"
rich = "^13.7.0"
pandas = "^2.2.2"
vllm = { version = "^0.5.0", optional = true }
pandas = "^1.15.0"
numpy = "^1.24.0"
vllm = { version = "^0.6.0", optional = true }
vllm-nccl-cu12 = { version = ">=2.18,<2.19", optional = true }
ray = { version = "^2.9.3", optional = true }
cupy-cuda12x = { version = "12.1.0", optional = true }
Expand Down
3 changes: 2 additions & 1 deletion vec_inf/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# `vec-inf` Commands

* `launch`: Specify a model family and other optional parameters to launch an OpenAI compatible inference server, `--json-mode` supported. Check [`here`](./models/README.md) for complete list of available options.
* `list`: List all available model names, `--json-mode` supported.
* `list`: List all available model names, or append a supported model name to view the default configuration, `--json-mode` supported.
* `metrics`: Streams performance metrics to the console.
* `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
* `shutdown`: Shutdown a model by providing its Slurm job ID.

Expand Down
182 changes: 151 additions & 31 deletions vec_inf/cli/_cli.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
import os
from typing import Optional
import time
from typing import Optional, cast

import click

import polars as pl
from rich.columns import Columns
from rich.console import Console
from rich.live import Live
from rich.panel import Panel

import vec_inf.cli._utils as utils
Expand All @@ -24,9 +28,19 @@ def cli():
@click.option(
"--max-model-len",
type=int,
help="Model context length. If unspecified, will be automatically derived from the model config.",
help="Model context length. Default value set based on suggested resource allocation.",
)
@click.option(
"--max-num-seqs",
type=int,
help="Maximum number of sequences to process in a single request",
)
@click.option(
"--partition",
type=str,
default="a40",
help="Type of compute partition, default to a40",
)
@click.option("--partition", type=str, help="Type of compute partition, default to a40")
@click.option(
"--num-nodes",
type=int,
Expand All @@ -40,24 +54,48 @@ def cli():
@click.option(
"--qos",
type=str,
help="Quality of service, default depends on suggested resource allocation required for the model",
help="Quality of service",
)
@click.option(
"--time",
type=str,
help="Time limit for job, this should comply with QoS, default to max walltime of the chosen QoS",
help="Time limit for job, this should comply with QoS limits",
)
@click.option(
"--vocab-size",
type=int,
help="Vocabulary size, this option is intended for custom models",
)
@click.option("--data-type", type=str, help="Model data type, default to auto")
@click.option("--venv", type=str, help="Path to virtual environment")
@click.option(
"--data-type", type=str, default="auto", help="Model data type, default to auto"
)
@click.option(
"--venv",
type=str,
default="singularity",
help="Path to virtual environment, default to preconfigured singularity container",
)
@click.option(
"--log-dir",
type=str,
help="Path to slurm log directory, default to .vec-inf-logs in home directory",
default="default",
help="Path to slurm log directory, default to .vec-inf-logs in user home directory",
)
@click.option(
"--model-weights-parent-dir",
type=str,
default="/model-weights",
help="Path to parent directory containing model weights, default to '/model-weights' for supported models",
)
@click.option(
"--pipeline-parallelism",
type=str,
help="Enable pipeline parallelism, accepts 'True' or 'False', default to 'True' for supported models",
)
@click.option(
"--enforce-eager",
type=str,
help="Always use eager-mode PyTorch, accepts 'True' or 'False', default to 'False' for custom models if not set",
)
@click.option(
"--json-mode",
Expand All @@ -69,6 +107,7 @@ def launch(
model_family: Optional[str] = None,
model_variant: Optional[str] = None,
max_model_len: Optional[int] = None,
max_num_seqs: Optional[int] = None,
partition: Optional[str] = None,
num_nodes: Optional[int] = None,
num_gpus: Optional[int] = None,
Expand All @@ -78,30 +117,40 @@ def launch(
data_type: Optional[str] = None,
venv: Optional[str] = None,
log_dir: Optional[str] = None,
model_weights_parent_dir: Optional[str] = None,
pipeline_parallelism: Optional[str] = None,
enforce_eager: Optional[str] = None,
json_mode: bool = False,
) -> None:
"""
Launch a model on the cluster
"""

if isinstance(pipeline_parallelism, str):
pipeline_parallelism = (
"True" if pipeline_parallelism.lower() == "true" else "False"
)

launch_script_path = os.path.join(
os.path.dirname(os.path.dirname(os.path.realpath(__file__))), "launch_server.sh"
)
launch_cmd = f"bash {launch_script_path}"

models_df = utils.load_models_df()

if model_name in models_df["model_name"].values:
if model_name in models_df["model_name"].to_list():
default_args = utils.load_default_args(models_df, model_name)
for arg in default_args:
if arg in locals() and locals()[arg] is not None:
default_args[arg] = locals()[arg]
renamed_arg = arg.replace("_", "-")
launch_cmd += f" --{renamed_arg} {default_args[arg]}"
else:
model_args = models_df.columns.tolist()
excluded_keys = ["model_name", "pipeline_parallelism"]
model_args = models_df.columns
model_args.remove("model_name")
model_args.remove("model_type")
for arg in model_args:
if arg not in excluded_keys and locals()[arg] is not None:
if locals()[arg] is not None:
renamed_arg = arg.replace("_", "-")
launch_cmd += f" --{renamed_arg} {locals()[arg]}"

Expand Down Expand Up @@ -225,40 +274,111 @@ def shutdown(slurm_job_id: int) -> None:
is_flag=True,
help="Output in JSON string",
)
def list(model_name: Optional[str] = None, json_mode: bool = False) -> None:
def list_models(model_name: Optional[str] = None, json_mode: bool = False) -> None:
"""
List all available models, or get default setup of a specific model
"""
models_df = utils.load_models_df()

if model_name:
if model_name not in models_df["model_name"].values:
def list_model(model_name: str, models_df: pl.DataFrame, json_mode: bool):
if model_name not in models_df["model_name"].to_list():
raise ValueError(f"Model name {model_name} not found in available models")

excluded_keys = {"venv", "log_dir", "pipeline_parallelism"}
model_row = models_df.loc[models_df["model_name"] == model_name]
excluded_keys = {"venv", "log_dir"}
model_row = models_df.filter(models_df["model_name"] == model_name)

if json_mode:
# click.echo(model_row.to_json(orient='records'))
filtered_model_row = model_row.drop(columns=excluded_keys, errors="ignore")
click.echo(filtered_model_row.to_json(orient="records"))
filtered_model_row = model_row.drop(excluded_keys, strict=False)
click.echo(filtered_model_row.to_dicts()[0])
return
table = utils.create_table(key_title="Model Config", value_title="Value")
for _, row in model_row.iterrows():
for row in model_row.to_dicts():
for key, value in row.items():
if key not in excluded_keys:
table.add_row(key, str(value))
CONSOLE.print(table)
return

if json_mode:
click.echo(models_df["model_name"].to_json(orient="records"))
return
panels = []
for _, row in models_df.iterrows():
styled_text = f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
panels.append(Panel(styled_text, expand=True))
CONSOLE.print(Columns(panels, equal=True))
def list_all(models_df: pl.DataFrame, json_mode: bool):
if json_mode:
click.echo(models_df["model_name"].to_list())
return
panels = []
model_type_colors = {
"LLM": "cyan",
"VLM": "bright_blue",
"Text Embedding": "purple",
"Reward Modeling": "bright_magenta",
}

models_df = models_df.with_columns(
pl.when(pl.col("model_type") == "LLM")
.then(0)
.when(pl.col("model_type") == "VLM")
.then(1)
.when(pl.col("model_type") == "Text Embedding")
.then(2)
.when(pl.col("model_type") == "Reward Modeling")
.then(3)
.otherwise(-1)
.alias("model_type_order")
)

models_df = models_df.sort("model_type_order")
models_df = models_df.drop("model_type_order")

for row in models_df.to_dicts():
panel_color = model_type_colors.get(row["model_type"], "white")
styled_text = (
f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
)
panels.append(Panel(styled_text, expand=True, border_style=panel_color))
CONSOLE.print(Columns(panels, equal=True))

models_df = utils.load_models_df()

if model_name:
list_model(model_name, models_df, json_mode)
else:
list_all(models_df, json_mode)


@cli.command("metrics")
@click.argument("slurm_job_id", type=int, nargs=1)
@click.option(
"--log-dir",
type=str,
help="Path to slurm log directory. This is required if --log-dir was set in model launch",
)
def metrics(slurm_job_id: int, log_dir: Optional[str] = None) -> None:
"""
Stream performance metrics to the console
"""
status_cmd = f"scontrol show job {slurm_job_id} --oneliner"
output = utils.run_bash_command(status_cmd)
slurm_job_name = output.split(" ")[1].split("=")[1]

with Live(refresh_per_second=1, console=CONSOLE) as live:
while True:
out_logs = utils.read_slurm_log(
slurm_job_name, slurm_job_id, "out", log_dir
)
# if out_logs is a string, then it is an error message
if isinstance(out_logs, str):
live.update(out_logs)
break
out_logs = cast(list, out_logs)
latest_metrics = utils.get_latest_metric(out_logs)
# if latest_metrics is a string, then it is an error message
if isinstance(latest_metrics, str):
live.update(latest_metrics)
break
latest_metrics = cast(dict, latest_metrics)
table = utils.create_table(key_title="Metric", value_title="Value")
for key, value in latest_metrics.items():
table.add_row(key, value)

live.update(table)

time.sleep(2)


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit d221dae

Please sign in to comment.