-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
节点的动态超分比例在增加CPU消耗后,不降反升 #604
Comments
@WangZzzhe 帮忙看看 |
@flpanbin 可以提供下节点的相关信息吗? |
创建 pod 前节点的资源信息: apiVersion: v1
kind: Node
metadata:
annotations:
katalyst.kubewharf.io/cpu_overcommit_ratio: "2.5"
katalyst.kubewharf.io/memory_overcommit_ratio: "2.5"
katalyst.kubewharf.io/original_allocatable_cpu: "16"
katalyst.kubewharf.io/original_allocatable_memory: 32676068Ki
katalyst.kubewharf.io/original_capacity_cpu: "16"
katalyst.kubewharf.io/original_capacity_memory: 32778468Ki
katalyst.kubewharf.io/overcommit_allocatable_cpu: 27840m
katalyst.kubewharf.io/overcommit_allocatable_memory: 38479337676800m
katalyst.kubewharf.io/overcommit_capacity_cpu: 27840m
katalyst.kubewharf.io/overcommit_capacity_memory: 38599923916800m
katalyst.kubewharf.io/realtime_cpu_overcommit_ratio: "1.74"
katalyst.kubewharf.io/realtime_memory_overcommit_ratio: "1.15"
...
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
katalyst.kubewharf.io/overcommit_node_pool: overcommit-demo
kubernetes.io/arch: amd64
kubernetes.io/hostname: g-master2
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
......
name: g-master2
status:
addresses:
- address: g-master2
type: Hostname
allocatable:
cpu: 27840m
ephemeral-storage: "136351265362"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 38479337676800m
pods: "180"
capacity:
cpu: 27840m
ephemeral-storage: 144483Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 38599923916800m
pods: "180" testpod1.yaml : apiVersion: v1
kind: Pod
metadata:
name: testpod1
namespace: katalyst-system
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- g-master2
containers:
- name: testcontainer1
image: polinux/stress:latest
command: ["stress"]
args: ["--cpu", "4", "--timeout", "6000"]
resources:
limits:
cpu: 8
memory: 8Gi
requests:
cpu: 4
memory: 8Gi
tolerations:
- effect: NoSchedule
key: test
value: test
operator: Equal |
@flpanbin |
感谢您的及时回复,我再观察下日志,不过针对您的回答有几个疑问:
|
@flpanbin 在负载不变的情况下,资源申请量增加,节点可分配资源减少,导致节点需要超分更多的资源来达到目标负载值。 |
感谢大佬,我研究研究。 |
@WangZzzhe I0609 01:45:03.275865 1 realtime.go:335] resource cpu request: 11964, allocatable: 16000, usage: 0, targetLoad: 0.6, existLoad: 0.4, overcommitRatio: 2.24775 overcommit-katalyst-agent 日志: I0609 03:01:06.734172 1 provisioner.go:84] [malachite] heartbeat
E0609 03:01:06.738246 1 provisioner.go:111] [malachite] malachite is unhealthy: invalid http response status code 500, url: http://localhost:9002/api/v1/system/compute
I0609 03:01:06.738555 1 round_trippers.go:553] GET https://10.6.202.113:10250/stats/summary?timeout=10s 403 Forbidden in 3 milliseconds
E0609 03:01:06.739508 1 provisioner.go:65] failed to update stats/summary from kubelet: "failed to get kubelet config for summary api, error: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)"
I0609 03:01:08.043645 1 realtime.go:155] [overcommitment-aware-realtime] sumUpPodsResources, cpu: 1845m, memory: 3715141632
E0609 03:01:08.043814 1 store_util.go:98] failed to get metric pod prometheus-insight-agent-kube-prometh-prometheus-0, container prometheus, metric cpu.usage.container, err: [MetricStore] empty map
E0609 03:01:08.044067 1 store_util.go:98] failed to get metric pod prometheus-insight-agent-kube-prometh-prometheus-0, container config-reloader, metric cpu.usage.container, err: [MetricStore] empty map malachite 日志报错,应该是没有正常工作: panbin@panbindeMacBook-Pro ~ % kubectl logs malachite-xk8n9 -n malachite-system -f
2024-06-09T02:03:07.481004862+00:00 - [ERROR] server/src/main.rs:187 [Panic] lib/src/cpu/processor.rs:464: called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }
2024-06-09T02:03:07.489192152+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:11.271581881+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:11.271754576+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:16.338537826+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:16.338612068+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:21.407855335+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:21.408025943+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:26.450034224+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:26.451268751+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:31.459491370+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:31.459570543+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:36.486691177+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:36.486756735+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:41.575957128+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:41.589261474+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:46.624823586+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:46.624905589+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:51.695793619+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:51.695892044+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:56.827341960+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:56.827457338+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:04:01.853256781+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:04:01.853372828+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:04:06.899599297+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned |
可能是和 linux 版本有关,环境信息: [root@g-master1 ~]# uname -a
Linux g-master1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@g-master1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7" k8s 和 containerd 版本: [root@g-master1 ~]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:48:26Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:42:11Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
[root@g-master1 ~]# containerd -v
containerd github.com/containerd/containerd v1.7.6 091922f03c2762540fd057fba91260237ff86acb |
我另外搭建了一个环境,使用 kubewharf enhanced kubernetes, 动态超分功能验证正常,看样子是对 Linux 内核版本和 containerd 的环境有要求? root@ubuntu:~/katalyst# uname -a
Linux ubuntu 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux root@ubuntu:~/katalyst# kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.6.202.170 Ready control-plane 26m v1.24.6-kubewharf.8
root@ubuntu:~/katalyst# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
root@ubuntu:~/katalyst# containerd -v
containerd github.com/containerd/containerd v1.4.12 7b11cfaabd73bb80907dd23182b9347b4245eb5d |
@flpanbin malachite 依赖 ebpf,所以 3.10 的内核应该不太行。4.19+ 应该可以 |
What happened?
我按照 动态超分的文档体验了下动态超分功能,但是在创建 testpod1 增加 cpu的消耗后,cpu的超分比 cpu_overcommit_ratio 不降反升。
没有pod运行时,查看 g-master2 的kcnr:
创建 testpod1 后,再次查看 g-master2 的kcnr:
katalyst 版本:
What did you expect to happen?
创建 testpod1 后, 对应节点的 cpu 超分比 katalyst.kubewharf.io/cpu_overcommit_ratio 降低。
How can we reproduce it (as minimally and precisely as possible)?
按照这个文档操作即可:https://gokatalyst.io/docs/user-guide/resource-overcommitment/dynamic-overcommitment/
Software version
The text was updated successfully, but these errors were encountered: