-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serving Catalog] Add llama3.1-405b vLLM GKE support with LWS #11
base: main
Are you sure you want to change the base?
Changes from all commits
e9b6773
8b46f9f
9f198ae
566f150
abe0928
0f47c0d
ec9c28a
67a1548
d4e2a4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# LeaderWorkerSet (lws) | ||
|
||
In order to be able to run the workloads on this directory you will need to install the lws controller. Instructions on how to do so can be found here: | ||
|
||
https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- leaderworkerset.yaml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: multihost-base | ||
spec: | ||
leaderWorkerTemplate: | ||
restartPolicy: RecreateGroupOnPodRestart | ||
leaderTemplate: | ||
metadata: | ||
labels: | ||
app: multihost-inference-server | ||
role: leader | ||
spec: | ||
containers: | ||
- name: multihost-leader-base | ||
workerTemplate: | ||
metadata: | ||
labels: | ||
app: multihost-inference-server | ||
spec: | ||
containers: | ||
- name: multihost-worker-base |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Kustomize lacks easy support for strategic patch merge for CRDs | ||
# Leader | ||
- op: add | ||
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector | ||
value: | ||
cloud.google.com/gke-accelerator: nvidia-h100-80gb | ||
- op: add | ||
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources | ||
value: | ||
limits: | ||
nvidia.com/gpu: "8" | ||
memory: 1770Gi | ||
ephemeral-storage: 800Gi | ||
requests: | ||
ephemeral-storage: 800Gi | ||
cpu: 125 | ||
- op: add | ||
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/env/- | ||
value: | ||
name: TENSOR_PARALLEL_SIZE | ||
value: "8" | ||
|
||
# Worker | ||
- op: copy | ||
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector | ||
path: /spec/leaderWorkerTemplate/workerTemplate/spec/nodeSelector | ||
- op: copy | ||
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources | ||
path: /spec/leaderWorkerTemplate/workerTemplate/spec/containers/0/resources |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1alpha1 | ||
kind: Component | ||
|
||
patches: | ||
- target: | ||
kind: LeaderWorkerSet | ||
path: h100.patch.yaml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: vllm-multihost-config | ||
data: | ||
ray_status_check.sh: |- | ||
#!/usr/bin/bash -x | ||
# Verify ray head status | ||
until ray status --address $LWS_LEADER_ADDRESS:6380; do | ||
sleep 5; | ||
done | ||
entrypoint.sh: |- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jjk-g can we propose to vllm to create containers that embeds this logic? we can create an issue on vllm repo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 vllm support for creating Ray cluster via their container is preferred There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue has been created on the vllm repo vllm-project/vllm#8302 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the issue hasn't been addressed, would it be better to merge this PR right now, and modify it later once a multihost vllm image is created? |
||
#!/usr/bin/bash -x | ||
# Launch vLLM Inference server | ||
|
||
export PYTHONPATH="/workspace/" | ||
if [[ -n "$1" ]]; then | ||
ray start --head --port=6380 | ||
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'` | ||
total_accelerators=$(($TENSOR_PARALLEL_SIZE * $PIPELINE_PARALLEL_SIZE )) | ||
until [ $num_accelerators -eq $total_accelerators ]; do | ||
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'` | ||
sleep 5 | ||
done | ||
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model $MODEL_ID --tensor_parallel_size $TENSOR_PARALLEL_SIZE --pipeline_parallel_size $PIPELINE_PARALLEL_SIZE | ||
else | ||
until ray start --address="$LWS_LEADER_ADDRESS":6380 --block; do | ||
sleep 5 | ||
done | ||
fi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- ../../base | ||
- service.yaml | ||
- configmap.yaml | ||
|
||
patches: | ||
- path: leaderworkerset.patch.yaml | ||
target: | ||
kind: LeaderWorkerSet |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: vllm-multihost-base | ||
spec: | ||
leaderWorkerTemplate: | ||
restartPolicy: RecreateGroupOnPodRestart | ||
leaderTemplate: | ||
spec: | ||
containers: | ||
- name: inference-server-leader | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/entrypoint.sh | ||
args: ["--head"] | ||
env: | ||
- name: HUGGING_FACE_HUB_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-secret | ||
key: hf_api_token | ||
- name: MODEL_ID | ||
valueFrom: | ||
configMapKeyRef: | ||
name: vllm-multihost-config | ||
key: model_id | ||
- name: PIPELINE_PARALLEL_SIZE | ||
value: $(LWS_GROUP_SIZE) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the user use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if I understand the question correctly. PIPELINE_PARALLEL_SIZE corresponds to the number of nodes that the model is deployed in, so it is the same value as The value of |
||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
- mountPath: /dev/shm | ||
name: dshm | ||
readinessProbe: | ||
tcpSocket: | ||
port: 8080 | ||
initialDelaySeconds: 15 | ||
periodSeconds: 10 | ||
failureThreshold: 60 | ||
ports: | ||
- containerPort: 8080 | ||
name: metrics | ||
volumes: | ||
- name: scripts-volume | ||
configMap: | ||
defaultMode: 0700 | ||
name: vllm-multihost-config | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 30Gi | ||
workerTemplate: | ||
spec: | ||
initContainers: | ||
- name: ray-head-check | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/ray_status_check.sh | ||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
containers: | ||
- name: inference-server-worker | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/entrypoint.sh | ||
env: | ||
- name: HUGGING_FACE_HUB_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-secret | ||
key: hf_api_token | ||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
- mountPath: /dev/shm | ||
name: dshm | ||
volumes: | ||
- name: scripts-volume | ||
configMap: | ||
defaultMode: 0700 | ||
name: vllm-multihost-config | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 30Gi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: vllm-leader | ||
spec: | ||
ports: | ||
- name: http | ||
port: 8080 | ||
protocol: TCP | ||
targetPort: 8080 | ||
selector: | ||
role: leader | ||
type: ClusterIP |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Llama3.1-405b | ||
|
||
## Configuration | ||
| Kind | Model Server | Model | Provider | Accelerator | | ||
| --- | --- | --- | --- | --- | | ||
| Deployment | vLLM | llama3.1-405b-it | GKE | GPU H100 | | ||
|
||
## Usage | ||
|
||
The template can be deployed with the following commands: | ||
|
||
``` | ||
kustomize build core/lws/vllm/llama3.1-405b-it/gke | kubectl apply -f - | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- ../../base | ||
|
||
components: | ||
- ../../../components/gke/resources/gpu/8-H100 | ||
|
||
patches: | ||
- path: leaderworkerset.patch.yaml | ||
target: | ||
kind: LeaderWorkerSet | ||
options: | ||
allowNameChange: true | ||
- target: | ||
kind: Service | ||
patch: |- | ||
- op: replace | ||
path: /metadata/name | ||
value: llama3-1-405b-vllm-service | ||
|
||
configMapGenerator: | ||
- name: vllm-multihost-config | ||
behavior: merge | ||
literals: | ||
- model_id="meta-llama/Meta-Llama-3.1-405B-Instruct" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: llama3-1-405b-it-lws | ||
spec: | ||
leaderWorkerTemplate: | ||
size: 2 | ||
leaderTemplate: | ||
metadata: | ||
labels: | ||
ai.gke.io/model: llama3-1-405b-it | ||
examples.ai.gke.io/source: blueprints | ||
workerTemplate: | ||
metadata: | ||
labels: | ||
ai.gke.io/model: llama3-1-405b-it | ||
examples.ai.gke.io/source: blueprints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the only way or just use statefulsets as well so that we provide more options for folks who do not want to deploy LWS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding statefulsets as well is outside the scope of this PR. I'll edit the title of the PR to reflect that it only covers an example using lws