Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: integrate with kwok to simulate mock GPU/NPU nodes #3830

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions hack/kwok/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# How to use kwok to create fake nodes to simulate scheduling
## Background
- In production, we may meet bugs related with GPU/NPU scheduling, we have to reproduce to find out where the bugs are. But in fact, we may not have a GPU/NPU testing environment to reproduce, so create fake nodes can help us to simulate scheduling not only to do features validation about GPU/NPU/[other extended resources] scheduling, but also help us to reproduce to locate scheduler bugs.
- Simulate scheduling can help us to do performance testing of scheduler, expecially for large-scale clusters, and kwok can create many fake nodes with very few resources.
## Usage
There are to two scripts in kwok dir to install kwok and create specific number of fake nodes. But first you need to have a kubernetes cluster.
### Install kwok
Execute the script `install-kwok.sh` under `hack/kwok` to install related CRDs of kwok.
### Create fake nodes
There is a script `create-fake-node.sh` under `hack/kwok` to help create example fake nodes. By default, directly execute the script `./create-fake-node.sh` will help us to create a fake node at Ready stage with 32 core CPUs, 256Gi memories and 110 pods for allocation. You can pass command line parameters to set specfic number of CPUs, memories, pods and even extended resources such as GPU/NPU etc, or create multiple fake nodes. Using like:
```shell
# create 10 fake nodes with 4 CPUs, 8Gi memories and extended resources with volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20
./create-fake-node.sh -n 10 -c 4 -m 8Gi -e volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20
```
You can use `./create-fake-node.sh -h` to see more details of command line parameters:
```shell
Usage: ./create-fake-node.sh [options]
-n NODE_COUNT Number of nodes to create (default: 1)
-b BASE_NODE_NAME Base name for nodes (default: kwok-node)
-c CPU Amount of CPU resources that can be allocated (default: 32)
-m MEMORY Amount of memory resources that can be allocated (default: 256Gi)
-p PODS Number of pods can be allocated (default: 110)
-e EXTENDED_RESOURCES Pairs of amount of extended resources that can be allocated, e.g., 'gpu=1,npu=2'
-h Display this help message
```
### Deploy fake pods
Under `hack/kwok/examples`, there is a example deployment yaml to create a fake pod, requests 2 CPU cores and 4Gi memories, you can follow this yaml to deploy your workload in writing on your own.
119 changes: 119 additions & 0 deletions hack/kwok/create-fake-node.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/bin/bash

# Copyright 2024 The Volcano Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# default parameters
NODE_COUNT=1
CPU="32"
MEMORY="256Gi"
PODS="110"
EXTENDED_RESOURCES=""
BASE_NODE_NAME="kwok-node"

# parse command line parameters
while getopts ":n:b:c:m:p:e:h" opt; do
case $opt in
n) NODE_COUNT="$OPTARG" ;;
b) BASE_NODE_NAME="$OPTARG" ;;
c) CPU="$OPTARG" ;;
m) MEMORY="$OPTARG" ;;
p) PODS="$OPTARG" ;;
e) EXTENDED_RESOURCES="$OPTARG" ;;
h)
echo "Usage: $0 [options]"
echo " -n NODE_COUNT Number of nodes to create (default: 1)"
echo " -b BASE_NODE_NAME Base name for nodes (default: kwok-node)"
echo " -c CPU Amount of CPU resources that can be allocated (default: 32)"
echo " -m MEMORY Amount of memory resources that can be allocated (default: 256Gi)"
echo " -p PODS Number of pods can be allocated (default: 110)"
echo " -e EXTENDED_RESOURCES Pairs of amount of extended resources that can be allocated, e.g., 'gpu=1,npu=2'"
echo " -h Display this help message"
exit 0
;;
\?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
esac
done

# parse extended resources if have
parse_extended_resources(){
local resources=$1
local result=""
if [[ -n "$resources" ]]; then
IFS=',' read -ra PAIRS <<< "$resources"
for pair in "${PAIRS[@]}"; do
IFS='=' read -ra KV <<< "$pair"
key="${KV[0]}"
value="${KV[1]}"
result="${result}
$key: $value"
done
fi
echo "$result"
}

EXTENDED_RESOURCES_YAML=$(parse_extended_resources "$EXTENDED_RESOURCES")

# create kwok fake nodes
for ((i=0; i<NODE_COUNT; i++))
do
NODE_NAME="${BASE_NODE_NAME}-${i}"
kubectl apply -f - <<EOF
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
kwok.x-k8s.io/node: fake
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: $NODE_NAME
kubernetes.io/os: linux
kubernetes.io/role: agent
node-role.kubernetes.io/agent: ""
type: kwok
name: $NODE_NAME
spec:
taints:
- effect: NoSchedule
key: kwok.x-k8s.io/node
value: fake
status:
allocatable:
cpu: $CPU
memory: $MEMORY
pods: $PODS$EXTENDED_RESOURCES_YAML
capacity:
cpu: $CPU
memory: $MEMORY
pods: $PODS$EXTENDED_RESOURCES_YAML
nodeInfo:
architecture: amd64
bootID: ""
containerRuntimeVersion: ""
kernelVersion: ""
kubeProxyVersion: fake
kubeletVersion: fake
machineID: ""
operatingSystem: linux
osImage: ""
systemUUID: ""
phase: Running
EOF
done
27 changes: 27 additions & 0 deletions hack/kwok/examples/fake-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: fake-deployment
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: fake-deployment
template:
metadata:
labels:
app: fake-deployment
spec:
schedulerName: volcano
tolerations:
- key: "kwok.x-k8s.io/node"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: fake-container
image: fake-image
resources:
requests:
cpu: 2
memory: 4Gi
22 changes: 22 additions & 0 deletions hack/kwok/install-kwok.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

# Copyright 2024 The Volcano Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

KWOK_REPO=kubernetes-sigs/kwok
KWOK_LATEST_RELEASE=$(curl "https://api.github.com/repos/${KWOK_REPO}/releases/latest" | jq -r '.tag_name')
# Deploy kwok and set up CRDs
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml"
# Set up default CRs of stages
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml"
Loading