-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serving Catalog] Add llama3.1-405b vLLM GKE support with LWS #11
base: main
Are you sure you want to change the base?
Changes from 2 commits
e9b6773
8b46f9f
9f198ae
566f150
abe0928
0f47c0d
ec9c28a
67a1548
d4e2a4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- leaderworkerset.yaml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: multihost-base | ||
spec: | ||
leaderWorkerTemplate: | ||
restartPolicy: RecreateGroupOnPodRestart | ||
leaderTemplate: | ||
metadata: | ||
labels: | ||
app: multihost-inference-server | ||
spec: | ||
containers: | ||
- name: multihost-leader-base | ||
workerTemplate: | ||
metadata: | ||
labels: | ||
app: multihost-inference-server | ||
spec: | ||
containers: | ||
- name: multihost-worker-base |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Kustomize lacks easy support for strategic patch merge for CRDs | ||
# Leader | ||
- op: add | ||
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector | ||
value: | ||
cloud.google.com/gke-accelerator: nvidia-h100-80gb | ||
- op: add | ||
path: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources | ||
value: | ||
limits: | ||
nvidia.com/gpu: "8" | ||
memory: 1770Gi | ||
ephemeral-storage: 800Gi | ||
requests: | ||
ephemeral-storage: 800Gi | ||
cpu: 125 | ||
# Worker | ||
- op: copy | ||
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/nodeSelector | ||
path: /spec/leaderWorkerTemplate/workerTemplate/spec/nodeSelector | ||
- op: copy | ||
from: /spec/leaderWorkerTemplate/leaderTemplate/spec/containers/0/resources | ||
path: /spec/leaderWorkerTemplate/workerTemplate/spec/containers/0/resources |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1alpha1 | ||
kind: Component | ||
|
||
patches: | ||
- target: | ||
kind: LeaderWorkerSet | ||
path: h100.patch.yaml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: vllm-multihost-config | ||
data: | ||
ray_status_check.sh: |- | ||
#!/usr/bin/bash -x | ||
# Verify ray head status | ||
until ray status --address $LWS_LEADER_ADDRESS:6380; do | ||
sleep 5; | ||
done | ||
entrypoint.sh: |- | ||
#!/usr/bin/bash -x | ||
# Launch vLLM Inference server | ||
|
||
export PYTHONPATH="/workspace/" | ||
if [[ -n "$1" ]]; then | ||
ray start --head --port=6380 | ||
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'` | ||
total_accelerators=$(($TENSOR_PARALLEL_SIZE * $PIPELINE_PARALLEL_SIZE )) | ||
until [ $num_accelerators -eq $total_accelerators ]; do | ||
num_accelerators=`python3 -c 'import ray; ray.init(); print(int(sum([ray.cluster_resources().get("GPU", 0), ray.cluster_resources().get("TPU", 0)])))'` | ||
sleep 5 | ||
done | ||
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model $MODEL_ID --tensor_parallel_size $TENSOR_PARALLEL_SIZE --pipeline_parallel_size $PIPELINE_PARALLEL_SIZE | ||
else | ||
until ray start --address="$LWS_LEADER_ADDRESS":6380 --block; do | ||
sleep 5 | ||
done | ||
fi | ||
|
||
Edwinhr716 marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- ../../base | ||
- service.yaml | ||
- configmap.yaml | ||
|
||
patches: | ||
- path: leaderworkerset.yaml | ||
target: | ||
kind: LeaderWorkerSet |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: vllm-multihost-base | ||
spec: | ||
leaderWorkerTemplate: | ||
restartPolicy: RecreateGroupOnPodRestart | ||
leaderTemplate: | ||
spec: | ||
containers: | ||
- name: inference-server-leader | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/entrypoint.sh | ||
args: ["--head"] | ||
env: | ||
- name: HUGGING_FACE_HUB_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-secret | ||
key: hf_api_token | ||
- name: MODEL_ID | ||
valueFrom: | ||
configMapKeyRef: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using a configmap is inflating the lws yaml, assuming we can bake the logic in the container, do we still need it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need it, although in practice I liked defining the environment variables shared across leaders/workers in one place. |
||
name: vllm-multihost-config | ||
key: model_id | ||
- name: TENSOR_PARALLEL_SIZE | ||
valueFrom: | ||
configMapKeyRef: | ||
name: vllm-multihost-config | ||
key: tensor_parallel_size | ||
- name: PIPELINE_PARALLEL_SIZE | ||
valueFrom: | ||
configMapKeyRef: | ||
name: vllm-multihost-config | ||
key: pipeline_parallel_size | ||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
- mountPath: /dev/shm | ||
name: dshm | ||
ports: | ||
- containerPort: 8080 | ||
volumes: | ||
- name: scripts-volume | ||
configMap: | ||
defaultMode: 0700 | ||
name: vllm-multihost-config | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 30Gi | ||
workerTemplate: | ||
spec: | ||
initContainers: | ||
- name: ray-head-check | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/ray_status_check.sh | ||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
containers: | ||
- name: inference-server-worker | ||
image: vllm/vllm-openai:latest | ||
command: | ||
- /scripts/entrypoint.sh | ||
env: | ||
- name: HUGGING_FACE_HUB_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-secret | ||
key: hf_api_token | ||
- name: MODEL_ID | ||
valueFrom: | ||
configMapKeyRef: | ||
name: vllm-multihost-config | ||
key: model_id | ||
- name: TENSOR_PARALLEL_SIZE | ||
valueFrom: | ||
configMapKeyRef: | ||
name: vllm-multihost-config | ||
key: tensor_parallel_size | ||
volumeMounts: | ||
- mountPath: "/scripts" | ||
name: scripts-volume | ||
readOnly: true | ||
- mountPath: /dev/shm | ||
name: dshm | ||
volumes: | ||
- name: scripts-volume | ||
configMap: | ||
defaultMode: 0700 | ||
name: vllm-multihost-config | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 30Gi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: vllm-leader | ||
spec: | ||
ports: | ||
- name: http | ||
port: 8080 | ||
protocol: TCP | ||
targetPort: 8080 | ||
selector: | ||
role: leader | ||
type: ClusterIP |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Llama3.1-405b | ||
|
||
## Configuration | ||
| Kind | Model Server | Model | Provider | Accelerator | | ||
| --- | --- | --- | --- | --- | | ||
| Deployment | vLLM | llama3.1-405b | GKE | GPU H100 | | ||
Edwinhr716 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Usage | ||
|
||
The template can be deployed with the following commands: | ||
|
||
``` | ||
kustomize build core/lws/vllm/llama3-405b/gke | kubectl apply -f - | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# kustomization.yaml | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
resources: | ||
- ../../base | ||
|
||
components: | ||
- ../../../components/gke/resources/gpu/8-H100 | ||
|
||
patches: | ||
- path: patch.yaml | ||
target: | ||
kind: LeaderWorkerSet | ||
options: | ||
allowNameChange: true | ||
- target: | ||
kind: Service | ||
patch: |- | ||
- op: replace | ||
path: /metadata/name | ||
value: llama3-405b-vllm-service | ||
|
||
configMapGenerator: | ||
- name: vllm-multihost-config | ||
behavior: merge | ||
literals: | ||
- model_id="meta-llama/Meta-Llama-3.1-405B-Instruct" | ||
- tensor_parallel_size="8" | ||
- pipeline_parallel_size="2" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
apiVersion: leaderworkerset.x-k8s.io/v1 | ||
kind: LeaderWorkerSet | ||
metadata: | ||
name: llama3-405b-lws | ||
spec: | ||
leaderWorkerTemplate: | ||
size: 2 | ||
leaderTemplate: | ||
metadata: | ||
labels: | ||
ai.gke.io/model: llama3-405b | ||
workerTemplate: | ||
metadata: | ||
labels: | ||
ai.gke.io/model: llama3-405b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjk-g can we propose to vllm to create containers that embeds this logic? we can create an issue on vllm repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 vllm support for creating Ray cluster via their container is preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue has been created on the vllm repo vllm-project/vllm#8302
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the issue hasn't been addressed, would it be better to merge this PR right now, and modify it later once a multihost vllm image is created?