Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serving Catalog] Add llama3.1-405b vLLM GKE support with LWS #11

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions serving-catalog/core/lws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# LeaderWorkerSet (lws)

In order to be able to run the workloads on this directory you will need to install the lws controller. Instructions on how to do so can be found here:
Copy link

@skonto skonto Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the only way or just use statefulsets as well so that we provide more options for folks who do not want to deploy LWS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding statefulsets as well is outside the scope of this PR. I'll edit the title of the PR to reflect that it only covers an example using lws


https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md

Edwinhr716 marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions serving-catalog/core/lws/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- leaderworkerset.yaml
22 changes: 22 additions & 0 deletions serving-catalog/core/lws/base/leaderworkerset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: multihost-base
spec:
leaderWorkerTemplate:
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
app: multihost-inference-server
role: leader
spec:
containers:
- name: multihost-leader-base
workerTemplate:
metadata:
labels:
app: multihost-inference-server
spec:
containers:
- name: multihost-worker-base
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Kustomize lacks easy support for strategic patch merge for CRDs
# Leader
- op: add
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector
value:
cloud.google.com/gke-accelerator: nvidia-h100-80gb
- op: add
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources
value:
limits:
nvidia.com/gpu: "8"
memory: 1770Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
- op: add
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/env/-
value:
name: TENSOR_PARALLEL_SIZE
value: "8"

# Worker
- op: copy
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector
path: /spec/leaderWorkerTemplate/workerTemplate/spec/nodeSelector
- op: copy
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources
path: /spec/leaderWorkerTemplate/workerTemplate/spec/containers/0/resources
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component

patches:
- target:
kind: LeaderWorkerSet
path: h100.patch.yaml
31 changes: 31 additions & 0 deletions serving-catalog/core/lws/vllm/base/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-multihost-config
data:
ray_status_check.sh: |-
#!/usr/bin/bash -x
# Verify ray head status
until ray status --address $LWS_LEADER_ADDRESS:6380; do
sleep 5;
done
entrypoint.sh: |-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjk-g can we propose to vllm to create containers that embeds this logic? we can create an issue on vllm repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 vllm support for creating Ray cluster via their container is preferred

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue has been created on the vllm repo vllm-project/vllm#8302

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue hasn't been addressed, would it be better to merge this PR right now, and modify it later once a multihost vllm image is created?

#!/usr/bin/bash -x
# Launch vLLM Inference server

export PYTHONPATH="/workspace/"
if [[ -n "$1" ]]; then
ray start --head --port=6380
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'`
total_accelerators=$(($TENSOR_PARALLEL_SIZE * $PIPELINE_PARALLEL_SIZE ))
until [ $num_accelerators -eq $total_accelerators ]; do
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'`
sleep 5
done
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model $MODEL_ID --tensor_parallel_size $TENSOR_PARALLEL_SIZE --pipeline_parallel_size $PIPELINE_PARALLEL_SIZE
else
until ray start --address="$LWS_LEADER_ADDRESS":6380 --block; do
sleep 5
done
fi

Edwinhr716 marked this conversation as resolved.
Show resolved Hide resolved
13 changes: 13 additions & 0 deletions serving-catalog/core/lws/vllm/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ../../base
- service.yaml
- configmap.yaml

patches:
- path: leaderworkerset.patch.yaml
target:
kind: LeaderWorkerSet
82 changes: 82 additions & 0 deletions serving-catalog/core/lws/vllm/base/leaderworkerset.patch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: vllm-multihost-base
spec:
leaderWorkerTemplate:
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
spec:
containers:
- name: inference-server-leader
image: vllm/vllm-openai:latest
command:
- /scripts/entrypoint.sh
args: ["--head"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
- name: MODEL_ID
valueFrom:
configMapKeyRef:
name: vllm-multihost-config
key: model_id
- name: PIPELINE_PARALLEL_SIZE
value: $(LWS_GROUP_SIZE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user use leaderworkerset.sigs.k8s.io/size to inject it instead? Is there a recommended value for the specific deployment?

Copy link
Author

@Edwinhr716 Edwinhr716 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the question correctly. PIPELINE_PARALLEL_SIZE corresponds to the number of nodes that the model is deployed in, so it is the same value as leaderworkerset.sigs.k8s.io/size.

The value of $(LWS_GROUP_SIZE) contains the same value as leaderworkerset.sigs.k8s.io/size

volumeMounts:
- mountPath: "/scripts"
name: scripts-volume
readOnly: true
- mountPath: /dev/shm
name: dshm
ports:
- containerPort: 8080
volumes:
- name: scripts-volume
configMap:
defaultMode: 0700
name: vllm-multihost-config
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
workerTemplate:
spec:
initContainers:
- name: ray-head-check
image: vllm/vllm-openai:latest
command:
- /scripts/ray_status_check.sh
volumeMounts:
- mountPath: "/scripts"
name: scripts-volume
readOnly: true
containers:
- name: inference-server-worker
image: vllm/vllm-openai:latest
command:
- /scripts/entrypoint.sh
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: "/scripts"
name: scripts-volume
readOnly: true
- mountPath: /dev/shm
name: dshm
volumes:
- name: scripts-volume
configMap:
defaultMode: 0700
name: vllm-multihost-config
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
13 changes: 13 additions & 0 deletions serving-catalog/core/lws/vllm/base/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: v1
kind: Service
metadata:
name: vllm-leader
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
role: leader
type: ClusterIP
14 changes: 14 additions & 0 deletions serving-catalog/core/lws/vllm/llama3-405b/gke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Llama3.1-405b

## Configuration
| Kind | Model Server | Model | Provider | Accelerator |
| --- | --- | --- | --- | --- |
| Deployment | vLLM | llama3.1-405b | GKE | GPU H100 |
Edwinhr716 marked this conversation as resolved.
Show resolved Hide resolved

## Usage

The template can be deployed with the following commands:

```
kustomize build core/lws/vllm/llama3-405b/gke | kubectl apply -f -
```
28 changes: 28 additions & 0 deletions serving-catalog/core/lws/vllm/llama3-405b/gke/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ../../base

components:
- ../../../components/gke/resources/gpu/8-H100

patches:
- path: leaderworkerset.patch.yaml
target:
kind: LeaderWorkerSet
options:
allowNameChange: true
- target:
kind: Service
patch: |-
- op: replace
path: /metadata/name
value: llama3-405b-vllm-service

configMapGenerator:
- name: vllm-multihost-config
behavior: merge
literals:
- model_id="meta-llama/Meta-Llama-3.1-405B-Instruct"
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: llama3-405b-lws
spec:
leaderWorkerTemplate:
size: 2
leaderTemplate:
metadata:
labels:
ai.gke.io/model: llama3-405b
examples.ai.gke.io/source: blueprints
workerTemplate:
metadata:
labels:
ai.gke.io/model: llama3-405b
examples.ai.gke.io/source: blueprints