-
Notifications
You must be signed in to change notification settings - Fork 977
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feature: integrate with kwok to simulate mock GPU/NPU nodes
Signed-off-by: jessestutler <[email protected]>
- Loading branch information
1 parent
24ca00b
commit 0ccda23
Showing
4 changed files
with
195 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# How to use kwok to create fake nodes to simulate scheduling | ||
## Background | ||
- In production, we may meet bugs related with GPU/NPU scheduling, we have to reproduce to find out where the bugs are. But in fact, we may not have a GPU/NPU testing environment to reproduce, so create fake nodes can help us to simulate scheduling not only to do features validation about GPU/NPU/[other extended resources] scheduling, but also help us to reproduce to locate scheduler bugs. | ||
- Simulate scheduling can help us to do performance testing of scheduler, expecially for large-scale clusters, and kwok can create many fake nodes with very few resources. | ||
## Usage | ||
There are to two scripts in kwok dir to install kwok and create specific number of fake nodes. But first you need to have a kubernetes cluster. | ||
### Install kwok | ||
Execute the script `install-kwok.sh` under `hack/kwok` to install related CRDs of kwok. | ||
### Create fake nodes | ||
There is a script `create-fake-node.sh` under `hack/kwok` to help create example fake nodes. By default, directly execute the script `./create-fake-node.sh` will help us to create a fake node at Ready stage with 32 core CPUs, 256Gi memories and 110 pods for allocation. You can pass command line parameters to set specfic number of CPUs, memories, pods and even extended resources such as GPU/NPU etc, or create multiple fake nodes. Using like: | ||
```shell | ||
# create 10 fake nodes with 4 CPUs, 8Gi memories and extended resources with volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20 | ||
./create-fake-node.sh -n 10 -c 4 -m 8Gi -e volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20 | ||
``` | ||
You can use `./create-fake-node.sh -h` to see more details of command line parameters: | ||
```shell | ||
Usage: ./create-fake-node.sh [options] | ||
-n NODE_COUNT Number of nodes to create (default: 1) | ||
-b BASE_NODE_NAME Base name for nodes (default: kwok-node) | ||
-c CPU Amount of CPU resources that can be allocated (default: 32) | ||
-m MEMORY Amount of memory resources that can be allocated (default: 256Gi) | ||
-p PODS Number of pods can be allocated (default: 110) | ||
-e EXTENDED_RESOURCES Pairs of amount of extended resources that can be allocated, e.g., 'gpu=1,npu=2' | ||
-h Display this help message | ||
``` | ||
### Deploy fake pods | ||
Under `hack/kwok/examples`, there is a example deployment yaml to create a fake pod, requests 2 CPU cores and 4Gi memories, you can follow this yaml to deploy your workload in writing on your own. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2024 The Volcano Authors. | ||
|
||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
|
||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# default parameters | ||
NODE_COUNT=1 | ||
CPU="32" | ||
MEMORY="256Gi" | ||
PODS="110" | ||
EXTENDED_RESOURCES="" | ||
BASE_NODE_NAME="kwok-node" | ||
|
||
# parse command line parameters | ||
while getopts ":n:b:c:m:p:e:h" opt; do | ||
case $opt in | ||
n) NODE_COUNT="$OPTARG" ;; | ||
b) BASE_NODE_NAME="$OPTARG" ;; | ||
c) CPU="$OPTARG" ;; | ||
m) MEMORY="$OPTARG" ;; | ||
p) PODS="$OPTARG" ;; | ||
e) EXTENDED_RESOURCES="$OPTARG" ;; | ||
h) | ||
echo "Usage: $0 [options]" | ||
echo " -n NODE_COUNT Number of nodes to create (default: 1)" | ||
echo " -b BASE_NODE_NAME Base name for nodes (default: kwok-node)" | ||
echo " -c CPU Amount of CPU resources that can be allocated (default: 32)" | ||
echo " -m MEMORY Amount of memory resources that can be allocated (default: 256Gi)" | ||
echo " -p PODS Number of pods can be allocated (default: 110)" | ||
echo " -e EXTENDED_RESOURCES Pairs of amount of extended resources that can be allocated, e.g., 'gpu=1,npu=2'" | ||
echo " -h Display this help message" | ||
exit 0 | ||
;; | ||
\?) | ||
echo "Invalid option: -$OPTARG" >&2 | ||
exit 1 | ||
;; | ||
esac | ||
done | ||
|
||
# parse extended resources if have | ||
parse_extended_resources(){ | ||
local resources=$1 | ||
local result="" | ||
if [[ -n "$resources" ]]; then | ||
IFS=',' read -ra PAIRS <<< "$resources" | ||
for pair in "${PAIRS[@]}"; do | ||
IFS='=' read -ra KV <<< "$pair" | ||
key="${KV[0]}" | ||
value="${KV[1]}" | ||
result="${result} | ||
$key: $value" | ||
done | ||
fi | ||
echo "$result" | ||
} | ||
|
||
EXTENDED_RESOURCES_YAML=$(parse_extended_resources "$EXTENDED_RESOURCES") | ||
|
||
# create kwok fake nodes | ||
for ((i=0; i<NODE_COUNT; i++)) | ||
do | ||
NODE_NAME="${BASE_NODE_NAME}-${i}" | ||
kubectl apply -f - <<EOF | ||
apiVersion: v1 | ||
kind: Node | ||
metadata: | ||
annotations: | ||
node.alpha.kubernetes.io/ttl: "0" | ||
kwok.x-k8s.io/node: fake | ||
labels: | ||
beta.kubernetes.io/arch: amd64 | ||
beta.kubernetes.io/os: linux | ||
kubernetes.io/arch: amd64 | ||
kubernetes.io/hostname: $NODE_NAME | ||
kubernetes.io/os: linux | ||
kubernetes.io/role: agent | ||
node-role.kubernetes.io/agent: "" | ||
type: kwok | ||
name: $NODE_NAME | ||
spec: | ||
taints: | ||
- effect: NoSchedule | ||
key: kwok.x-k8s.io/node | ||
value: fake | ||
status: | ||
allocatable: | ||
cpu: $CPU | ||
memory: $MEMORY | ||
pods: $PODS$EXTENDED_RESOURCES_YAML | ||
capacity: | ||
cpu: $CPU | ||
memory: $MEMORY | ||
pods: $PODS$EXTENDED_RESOURCES_YAML | ||
nodeInfo: | ||
architecture: amd64 | ||
bootID: "" | ||
containerRuntimeVersion: "" | ||
kernelVersion: "" | ||
kubeProxyVersion: fake | ||
kubeletVersion: fake | ||
machineID: "" | ||
operatingSystem: linux | ||
osImage: "" | ||
systemUUID: "" | ||
phase: Running | ||
EOF | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: fake-deployment | ||
namespace: default | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: fake-deployment | ||
template: | ||
metadata: | ||
labels: | ||
app: fake-deployment | ||
spec: | ||
schedulerName: volcano | ||
tolerations: | ||
- key: "kwok.x-k8s.io/node" | ||
operator: "Exists" | ||
effect: "NoSchedule" | ||
containers: | ||
- name: fake-container | ||
image: fake-image | ||
resources: | ||
requests: | ||
cpu: 2 | ||
memory: 4Gi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2024 The Volcano Authors. | ||
|
||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
|
||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
KWOK_REPO=kubernetes-sigs/kwok | ||
KWOK_LATEST_RELEASE=$(curl "https://api.github.com/repos/${KWOK_REPO}/releases/latest" | jq -r '.tag_name') | ||
# Deploy kwok and set up CRDs | ||
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml" | ||
# Set up default CRs of stages | ||
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml" |