diff --git a/HAMi.jpg b/HAMi.jpg deleted file mode 100644 index 53ecf94..0000000 Binary files a/HAMi.jpg and /dev/null differ diff --git a/README.md b/README.md deleted file mode 100644 index 6b58157..0000000 --- a/README.md +++ /dev/null @@ -1,324 +0,0 @@ -English version|[中文版](README_cn.md) - - - -# Heterogeneous AI Computing Virtualization Middleware - -[![build status](https://github.com/Project-HAMi/HAMi/actions/workflows/main.yml/badge.svg)](https://github.com/Project-HAMi/HAMi/actions/workflows/main.yml) -[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu) -[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/k8s-device-plugin/shared_invite/zt-oi9zkr5c-LsMzNmNs7UYg6usc0OiWKw) -[![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions) -[![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact) - -## Supperted devices - -[![nvidia GPU](https://img.shields.io/badge/Nvidia-GPU-blue)](https://github.com/Project-HAMi/HAMi#preparing-your-gpu-nodes) -[![cambricon MLU](https://img.shields.io/badge/Cambricon-Mlu-blue)](docs/cambricon-mlu-support.md) -[![hygon DCU](https://img.shields.io/badge/Hygon-DCU-blue)](docs/hygon-dcu-support.md) -[![iluvatar GPU](https://img.shields.io/badge/Iluvatar-GPU-blue)](docs/iluvatar-gpu-support.md) - -## Introduction - -! - -**Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is an "all-in-one" chart designed to manage Heterogeneous AI Computing Devices in a k8s cluster.** It includes everything you would expect, such as: - -***Device sharing***: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks. - -***Device Memory Control***: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries. - -***Device Type Specification***: You can specify the type of device to use or avoid for a particular task by setting annotations, such as "nvidia.com/use-gputype" or "nvidia.com/nouse-gputype". - -***Easy to use***: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than "nvidia.com/gpu" if you prefer. - -## Major Features - -- Hard Limit on Device Memory. - -A simple demostration for Hard Limit: -A task with the following resources. - -``` - resources: - limits: - nvidia.com/gpu: 1 # requesting 1 vGPU - nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory -``` - -will see 3G device memory inside container - -![img](./imgs/hard_limit.jpg) - -- Allows partial device allocation by specifying device memory. -- Imposes a hard limit on streaming multiprocessors. -- Permits partial device allocation by specifying device core usage. -- Requires zero changes to existing programs. - -## Architect - -! - -HAMi consists of several components, including a unified mutatingwebhook, a unified scheduler extender, different device-plugins and different in-container virtualization technics for each heterogeneous AI devices. - -## Application Scenarios - -1. Device sharing (or device virtualization) on Kubernetes. -2. Scenarios where pods need to be allocated with specific device memory 3. usage or device cores. -3. Need to balance GPU usage in a cluster with multiple GPU nodes. -4. Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU. -5. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances. - -## Quick Start - -### Prerequisites - -The list of prerequisites for running the NVIDIA device plugin is described below: - -- NVIDIA drivers >= 440 -- CUDA Version > 10.2 -- nvidia-docker version > 2.0 -- Kubernetes version >= 1.16 -- glibc >= 2.17 & glibc < 2.3.0 -- kernel version >= 3.10 -- helm > 3.0 - -### Preparing your GPU Nodes - -
Configure nvidia-container-toolkit - -Execute the following steps on all your GPU nodes. - -This README assumes pre-installation of NVIDIA drivers and the `nvidia-container-toolkit`. Additionally, it assumes configuration of the `nvidia-container-runtime` as the default low-level runtime. - -Please see: - -#### Example for debian-based systems with `Docker` and `containerd` - -##### Install the `nvidia-container-toolkit` - -```bash -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list - -sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit -``` - -##### Configure `Docker` - -When running `Kubernetes` with `Docker`, edit the configuration file, typically located at `/etc/docker/daemon.json`, to set up `nvidia-container-runtime` as the default low-level runtime: - -```json -{ - "default-runtime": "nvidia", - "runtimes": { - "nvidia": { - "path": "/usr/bin/nvidia-container-runtime", - "runtimeArgs": [] - } - } -} -``` - -And then restart `Docker`: - -``` -sudo systemctl daemon-reload && systemctl restart docker -``` - -##### Configure `containerd` - -When running `Kubernetes` with `containerd`, modify the configuration file typically located at `/etc/containerd/config.toml`, to set up -`nvidia-container-runtime` as the default low-level runtime: - -``` -version = 2 -[plugins] - [plugins."io.containerd.grpc.v1.cri"] - [plugins."io.containerd.grpc.v1.cri".containerd] - default_runtime_name = "nvidia" - - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] - privileged_without_host_devices = false - runtime_engine = "" - runtime_root = "" - runtime_type = "io.containerd.runc.v2" - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] - BinaryName = "/usr/bin/nvidia-container-runtime" -``` - -And then restart `containerd`: - -``` -sudo systemctl daemon-reload && systemctl restart containerd -``` - -
- -
Label your nodes - -Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by our scheduler. - -``` -kubectl label nodes {nodeid} gpu=on -``` - -
- -### Install and Uninstall - -
Installation - -First, you need to check your Kubernetes version by using the following command: - -``` -kubectl version -``` - -Then, add our repo in helm - -``` -helm repo add hami-charts https://project-hami.github.io/HAMi/ -``` - -During installation, set the Kubernetes scheduler image version to match your Kubernetes server version. For instance, if your cluster server version is 1.16.8, use the following command for deployment: - -``` -helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system -``` - -Customize your installation by adjusting the [configs](docs/config.md). - -Verify your installation using the following command: - -``` -kubectl get pods -n kube-system -``` - -If both `vgpu-device-plugin` and `vgpu-scheduler` pods are in the *Running* state, your installation is successful. - -
- -
Upgrade - -Upgrading HAMi to the latest version is a simple process, update the repository and restart the chart: - -``` -helm uninstall hami -n kube-system -helm repo update -helm install hami hami-charts/hami -n kube-system -``` - -> **WARNING:** *If you upgrade HAMi without clearing your submitted tasks, it may result in segmentation fault.* - -
- -
Uninstall - -``` -helm uninstall hami -n kube-system -``` - -> **NOTICE:** *Uninstallation won't kill running tasks.* - -
- -### Submit Task - -
Task example - -Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu`` resource type. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: gpu-pod -spec: - containers: - - name: ubuntu-container - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - nvidia.com/gpu: 2 # requesting 2 vGPUs - nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer) - nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer) -``` - -Exercise caution; if a task cannot fit into any GPU node (i.e., the requested number of `nvidia.com/gpu` exceeds the available GPUs in any node), the task will remain in a `pending` state. - -You can now execute the `nvidia-smi` command in the container to observe the difference in GPU memory between vGPU and physical GPU. - -> **WARNING:** -> -> *1. if you don't request vGPUs when using the device plugin with NVIDIA images all -> the vGPUs on the machine will be exposed inside your container.* -> -> *2. Do not set "nodeName" field, use "nodeSelector" instead.* - -#### More examples - -Click [here](docs/examples/nvidia/) - -
- -### Monitor - -
Get cluster overview - -Monitoring is automatically enabled after installation. Obtain an overview of cluster information by visiting the following URL: - -``` -http://{scheduler ip}:{monitorPort}/metrics -``` - -The default monitorPort is 31993; other values can be set using `--set devicePlugin.service.httpPort` during installation. - -Grafana dashboard [example](docs/dashboard.md) - -> **Note** The status of a node won't be collected before you submit a task - -
- -## [Benchmarks](docs/benchmark.md) - -## Known Issues - -- Currently, A100 MIG can be supported in only "none" and "mixed" modes. -- Tasks with the "nodeName" field cannot be scheduled at the moment; please use "nodeSelector" instead. -- Only computing tasks are currently supported; video codec processing is not supported. - -## Roadmap - -Heterogeneous AI Computing device to support - -| Production | manufactor | MemoryIsolation | CoreIsolation | MultiCard support | -|-------------|------------|-----------------|---------------|-------------------| -| GPU | NVIDIA | ✅ | ✅ | ✅ | -| MLU | Cambricon | ✅ | ❌ | ❌ | -| DCU | Hygon | ✅ | ✅ | ❌ | -| Ascend | Huawei | In progress | In progress | ❌ | -| GPU | iluvatar | In progress | In progress | ❌ | -| DPU | Teco | In progress | In progress | ❌ | - -- Support video codec processing -- Support Multi-Instance GPUs (MIG) - -## Issues and Contributing - -- Report bugs, ask questions, or suggest modifications by [filing a new issue](https://github.com/Project-HAMi/HAMi/issues/new) -- For more information or to share your ideas, you can participate in the [Discussions](https://github.com/Project-HAMi/HAMi/discussions) and the [slack](https://join.slack.com/t/k8s-device-plugin/shared_invite/zt-oi9zkr5c-LsMzNmNs7UYg6usc0OiWKw) exchanges - -## Contact - -Owner & Maintainer: Limengxuan - -Feel free to reach me by - -``` -email: -phone: +86 18810644493 -WeChat: xuanzong4493 -``` diff --git a/README_cn.md b/README_cn.md deleted file mode 100644 index d8da2bd..0000000 --- a/README_cn.md +++ /dev/null @@ -1,286 +0,0 @@ - - -# HAMi--异构算力虚拟化中间件 - -[![build status](https://github.com/Project-HAMi/HAMi/actions/workflows/main.yml/badge.svg)](https://github.com/Project-HAMi/HAMi/actions/workflows/build.yml) -[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu) -[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/k8s-device-plugin/shared_invite/zt-oi9zkr5c-LsMzNmNs7UYg6usc0OiWKw) -[![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions) -[![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact) - -## 支持设备: - -[![英伟达 GPU](https://img.shields.io/badge/Nvidia-GPU-blue)](https://github.com/Project-HAMi/HAMi#preparing-your-gpu-nodes) -[![寒武纪 MLU](https://img.shields.io/badge/寒武纪-Mlu-blue)](docs/cambricon-mlu-support_cn.md) -[![海光 DCU](https://img.shields.io/badge/海光-DCU-blue)](docs/hygon-dcu-support.md) -[![天数智芯 GPU](https://img.shields.io/badge/天数智芯-GPU-blue)](docs/iluvatar-gpu-support_cn.md) - -## 简介 - -! - -异构算力虚拟化中间件HAMi满足了所有你对于管理异构算力集群所需要的能力,包括: - -***设备复用***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡 - -***可限制分配的显存大小***: 你现在可以用显存值(例如3000M)或者显存比例(例如50%)来分配GPU,vGPU调度器会确保任务使用的显存不会超过分配数值 - -***指定设备型号***:当前任务可以通过设置annotation的方式,来选择使用或者不使用某些具体型号的设备 - -***无侵入***: vGPU调度器兼容nvidia官方插件的显卡分配方式,所以安装完毕后,你不需要修改原有的任务文件就可以使用vGPU的功能。当然,你也可以自定义的资源名称 - -## 使用场景 - -1. 云原生场景下需要复用算力设备的场合 -2. 需要定制异构算力申请的场合,如申请特定显存大小的虚拟GPU,每个虚拟GPU使用特定比例的算力。 -3. 在多个异构算力节点组成的集群中,任务需要根据自身的显卡需求分配到合适的节点执行。 -4. 显存、计算单元利用率低的情况,如在一张GPU卡上运行10个tf-serving。 -5. 需要大量小显卡的情况,如教学场景把一张GPU提供给多个学生使用、云平台提供小GPU实例。 - -## 产品设计 - -! - -HAMi 包含以下几个组件,一个统一的mutatingwebhook,一个统一的调度器,以及针对各种不同的异构算力设备对应的设备插件和容器内的控制组件,整体的架构特性如上图所示。 - -## 产品特性 - -- 显存资源的硬隔离 - -一个硬隔离的简单展示: -一个使用以下方式定义的任务提交后 -``` - resources: - limits: - nvidia.com/gpu: 1 # requesting 1 vGPU - nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory -``` -会只有3G可见显存 - -![img](./imgs/hard_limit.jpg) - -- 允许通过指定显存来申请算力设备 -- 算力资源的硬隔离 -- 允许通过指定算力使用比例来申请算力设备 -- 对已有程序零改动 - -## 安装要求 - -* NVIDIA drivers >= 440 -* nvidia-docker version > 2.0 -* docker已配置nvidia作为默认runtime -* Kubernetes version >= 1.16 -* glibc >= 2.17 & glibc < 2.3.0 -* kernel version >= 3.10 -* helm > 3.0 - -## 快速入门 - -### 准备节点 - -
配置 nvidia-container-toolkit - -### GPU节点准备 - -以下步骤要在所有GPU节点执行,这份README文档假定GPU节点已经安装NVIDIA驱动。它还假设您已经安装docker或container并且需要将nvidia-container-runtime配置为要使用的默认低级运行时。 - -安装步骤举例: - -#### -``` -# 加入套件仓库 -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list - -sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit -``` - -##### 配置docker -你需要在节点上将nvidia runtime做为你的docker runtime预设值。我们将编辑docker daemon的配置文件,此文件通常在`/etc/docker/daemon.json`路径: - -``` -{ - "default-runtime": "nvidia", - "runtimes": { - "nvidia": { - "path": "/usr/bin/nvidia-container-runtime", - "runtimeArgs": [] - } - } -} -``` -``` -systemctl daemon-reload && systemctl restart docker -``` -##### 配置containerd -你需要在节点上将nvidia runtime做为你的containerd runtime预设值。我们将编辑containerd daemon的配置文件,此文件通常在`/etc/containerd/config.toml`路径 -``` -version = 2 -[plugins] - [plugins."io.containerd.grpc.v1.cri"] - [plugins."io.containerd.grpc.v1.cri".containerd] - default_runtime_name = "nvidia" - - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] - privileged_without_host_devices = false - runtime_engine = "" - runtime_root = "" - runtime_type = "io.containerd.runc.v2" - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] - BinaryName = "/usr/bin/nvidia-container-runtime" -``` -``` -systemctl daemon-reload && systemctl restart containerd -``` - -
- -
为GPU节点打上标签 - -最后,你需要将所有要使用到的GPU节点打上gpu=on标签,否则该节点不会被调度到 - -``` -$ kubectl label nodes {nodeid} gpu=on -``` - -
- -### 安装,更新与卸载 - -
安装 - -首先使用helm添加我们的 repo - -``` -helm repo add hami-charts https://project-hami.github.io/HAMi/ -``` - -随后,使用下列指令获取集群服务端版本 - -``` -kubectl version -``` - -在安装过程中须根据集群服务端版本(上一条指令的结果)指定调度器镜像版本,例如集群服务端版本为1.16.8,则可以使用如下指令进行安装 - -``` -$ helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system -``` - -你可以修改这里的[配置](docs/config_cn.md)来定制安装 - -通过kubectl get pods指令看到 `vgpu-device-plugin` 与 `vgpu-scheduler` 两个pod 状态为*Running* 即为安装成功 - -``` -$ kubectl get pods -n kube-system -``` - -
- -
更新 - -只需要更新helm repo,并重新启动整个Chart即可自动完成更新,最新的镜像会被自动下载 - -``` -$ helm uninstall hami -n kube-system -$ helm repo update -$ helm install hami hami-charts/hami -n kube-system -``` - -> **注意:** *如果你没有清理完任务就进行热更新的话,正在运行的任务可能会出现段错误等报错.* - -
- -
卸载 - -``` -$ helm uninstall hami -n kube-system -``` - -> **注意:** *卸载组件并不会使正在运行的任务失败.* - -
- -### 提交任务 - -
任务样例 - -NVIDIA vGPUs 现在能透过资源类型`nvidia.com/gpu`被容器请求: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: gpu-pod -spec: - containers: - - name: ubuntu-container - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - nvidia.com/gpu: 2 # 请求2个vGPUs - nvidia.com/gpumem: 3000 # 每个vGPU申请3000m显存 (可选,整数类型) - nvidia.com/gpucores: 30 # 每个vGPU的算力为30%实际显卡的算力 (可选,整数类型) -``` - -如果你的任务无法运行在任何一个节点上(例如任务的`nvidia.com/gpu`大于集群中任意一个GPU节点的实际GPU数量),那么任务会卡在`pending`状态 - -现在你可以在容器执行`nvidia-smi`命令,然后比较vGPU和实际GPU显存大小的不同。 - -> **注意:** *1. 如果你使用privileged字段的话,本任务将不会被调度,因为它可见所有的GPU,会对其它任务造成影响.* -> -> *2. 不要设置nodeName字段,类似需求请使用nodeSelector.* - -
- -### 监控: - -
访问集群算力视图 - -调度器部署成功后,监控默认自动开启,你可以通过 - -``` -http://{nodeip}:{monitorPort}/metrics -``` - -来获取监控数据,其中monitorPort可以在Values中进行配置,默认为31992 - -grafana dashboard [示例](docs/dashboard_cn.md) - -> **注意** 节点上的vGPU状态只有在其使用vGPU后才会被统计 - -
- -## [性能测试](docs/benchmark_cn.md) - -## 已知问题 - -- 目前仅支持计算任务,不支持视频编解码处理。 -- 暂时仅支持MIG的"none"和"mixed"模式,暂时不支持single模式 -- 当任务有字段“nodeName“时会出现无法调度的情况,有类似需求的请使用"nodeSelector"代替 - -## 开发计划 - -- 目前支持的异构算力设备及其对应的复用特性如下表所示 - -| 产品 | 制造商 | 显存隔离 | 算力隔离 | 多卡支持 | -|-------------|------------|-----------------|---------------|-------------------| -| GPU | NVIDIA | ✅ | ✅ | ✅ | -| MLU | 寒武纪 | ✅ | ❌ | ❌ | -| DCU | 海光 | ✅ | ✅ | ❌ | -| Ascend | 华为 | 开发中 | 开发中 | ❌ | -| GPU | 天数智芯 | 开发中 | 开发中 | ❌ | -| DPU | 太初 | 开发中 | 开发中 | ❌ | -- 支持视频编解码处理 -- 支持Multi-Instance GPUs (MIG) - - -## 反馈和参与 - -* bug、疑惑、修改欢迎提在 [Github Issues](https://github.com/Project-HAMi/HAMi/issues/new) -* 想了解更多或者有想法可以参与到[Discussions](https://github.com/Project-HAMi/HAMi/discussions)和[slack](https://join.slack.com/t/k8s-device-plugin/shared_invite/zt-oi9zkr5c-LsMzNmNs7UYg6usc0OiWKw)交流 - - diff --git a/docs/benchmark.md b/docs/benchmark.md deleted file mode 100644 index 91611d2..0000000 --- a/docs/benchmark.md +++ /dev/null @@ -1,49 +0,0 @@ -## Benchmarks - -Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows - -| Test Environment | description | -| ---------------- | :------------------------------------------------------: | -| Kubernetes version | v1.12.9 | -| Docker version | 18.09.1 | -| GPU Type | Tesla V100 | -| GPU Num | 2 | - -| Test instance | description | -| ------------- | :---------------------------------------------------------: | -| nvidia-device-plugin | k8s + nvidia k8s-device-plugin | -| vGPU-device-plugin | k8s + VGPU k8s-device-plugin,without virtual device memory | -| vGPU-device-plugin(virtual device memory) | k8s + VGPU k8s-device-plugin,with virtual device memory | - -Test Cases: - -| test id | case | type | params | -| ------- | :-----------: | :-------: | :---------------------: | -| 1.1 | Resnet-V2-50 | inference | batch=50,size=346*346 | -| 1.2 | Resnet-V2-50 | training | batch=20,size=346*346 | -| 2.1 | Resnet-V2-152 | inference | batch=10,size=256*256 | -| 2.2 | Resnet-V2-152 | training | batch=10,size=256*256 | -| 3.1 | VGG-16 | inference | batch=20,size=224*224 | -| 3.2 | VGG-16 | training | batch=2,size=224*224 | -| 4.1 | DeepLab | inference | batch=2,size=512*512 | -| 4.2 | DeepLab | training | batch=1,size=384*384 | -| 5.1 | LSTM | inference | batch=100,size=1024*300 | -| 5.2 | LSTM | training | batch=10,size=1024*300 | - -Test Result: ![img](../imgs/benchmark_inf.png) - -![img](../imgs/benchmark_train.png) - -To reproduce: - -1. install k8s-vGPU-scheduler,and configure properly -2. run benchmark job - -``` -$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml -``` - -3. View the result by using kubctl logs - -``` -$ kubectl logs [pod id] \ No newline at end of file diff --git a/docs/benchmark_cn.md b/docs/benchmark_cn.md deleted file mode 100644 index c1f5f1f..0000000 --- a/docs/benchmark_cn.md +++ /dev/null @@ -1,50 +0,0 @@ -## 性能测试 - -在测试报告中,我们一共在下面五种场景都执行了ai-benchmark 测试脚本,并汇总最终结果: - -| 测试环境 | 环境描述 | -| ---------------- | :------------------------------------------------------: | -| Kubernetes version | v1.12.9 | -| Docker version | 18.09.1 | -| GPU Type | Tesla V100 | -| GPU Num | 2 | - -| 测试名称 | 测试用例 | -| -------- | :------------------------------------------------: | -| Nvidia-device-plugin | k8s + nvidia官方k8s-device-plugin | -| vGPU-device-plugin | k8s + VGPU k8s-device-plugin,无虚拟显存 | -| vGPU-device-plugin(virtual device memory) | k8s + VGPU k8s-device-plugin,高负载,开启虚拟显存 | - -测试内容 - -| test id | 名称 | 类型 | 参数 | -| ------- | :-----------: | :-------: | :---------------------: | -| 1.1 | Resnet-V2-50 | inference | batch=50,size=346*346 | -| 1.2 | Resnet-V2-50 | training | batch=20,size=346*346 | -| 2.1 | Resnet-V2-152 | inference | batch=10,size=256*256 | -| 2.2 | Resnet-V2-152 | training | batch=10,size=256*256 | -| 3.1 | VGG-16 | inference | batch=20,size=224*224 | -| 3.2 | VGG-16 | training | batch=2,size=224*224 | -| 4.1 | DeepLab | inference | batch=2,size=512*512 | -| 4.2 | DeepLab | training | batch=1,size=384*384 | -| 5.1 | LSTM | inference | batch=100,size=1024*300 | -| 5.2 | LSTM | training | batch=10,size=1024*300 | - -测试结果: ![img](../imgs/benchmark_inf.png) - -![img](../imgs/benchmark_train.png) - -测试步骤: - -1. 安装nvidia-device-plugin,并配置相应的参数 -2. 运行benchmark任务 - -``` -$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml -``` - -3. 通过kubctl logs 查看结果 - -``` -$ kubectl logs [pod id] -``` \ No newline at end of file diff --git a/docs/cambricon-mlu-support.md b/docs/cambricon-mlu-support.md deleted file mode 100644 index 63a2a29..0000000 --- a/docs/cambricon-mlu-support.md +++ /dev/null @@ -1,62 +0,0 @@ -## Introduction - -**We now support cambricon.com/mlu by implementing most device-sharing features as nvidia-GPU**, including: - -***MLU sharing***: Each task can allocate a portion of MLU instead of a whole MLU card, thus MLU can be shared among multiple tasks. - -***Device Memory Control***: MLUs can be allocated with certain device memory size on certain type(i.e 370) and have made it that it does not exceed the boundary. - -***MLU Type Specification***: You can specify which type of MLU to use or to avoid for a certain task, by setting "cambricon.com/use-mlutype" or "cambricon.com/nouse-mlutype" annotations. - -***Very Easy to use***: You don't need to modify your task yaml to use our scheduler. All your MLU jobs will be automatically supported after installation. The only thing you need to do is tag the MLU node. - -## Prerequisites - -* neuware-mlu370-driver > 4.15.10 -* cntoolkit > 2.5.3 - -## Enabling MLU-sharing Support - -* Install the chart using helm, See 'enabling vGPU support in kubernetes' section [here](https://github.com/Project-HAMi/HAMi#enabling-vgpu-support-in-kubernetes) - -* Tag MLU node with the following command -``` -kubectl label node {mlu-node} mlu=on -``` - -## Running MLU jobs - -Cambricon MMLUs can now be requested by a container -using the `cambricon.com/mlunum` and `cambricon.com/mlumem` resource type: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: gpu-pod -spec: - containers: - - name: ubuntu-container - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - cambricon.com/mlunum: 1 # requesting 1 MLU - cambricon.com/mlumem: 10240 # requesting 10G MLU device memory - - name: ubuntu-container1 - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - cambricon.com/mlunum: 1 # requesting 1 MLU - cambricon.com/mlumem: 10240 # requesting 10G MLU device memory -``` - -## Notes - -1. Mlu-sharing in init container is not supported, pods with "combricon.com/mlumem" in init container will never be scheduled. - -2. Mlu-sharing with containerd is not supported, the container may not start successfully. - -3. Mlu-sharing can only be applied on MLU-370 - \ No newline at end of file diff --git a/docs/cambricon-mlu-support_cn.md b/docs/cambricon-mlu-support_cn.md deleted file mode 100644 index 72878af..0000000 --- a/docs/cambricon-mlu-support_cn.md +++ /dev/null @@ -1,59 +0,0 @@ -## 简介 - -本组件支持复用寒武纪MLU设备,并为此提供以下几种与vGPU类似的复用功能,包括: - -***MLU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡 - -***可限制分配的显存大小***: 你现在可以用显存值(例如3000M)来分配MLU,本组件会确保任务使用的显存不会超过分配数值,注意只有MLU-370型号的MLU支持可配显存 - -***指定MLU型号***:当前任务可以通过设置annotation("cambricon.com/use-mlutype","cambricon.com/nouse-mlutype")的方式,来选择使用或者不使用某些具体型号的MLU - -***方便易用***: 部署本组件后,你只需要给MLU节点打上tag即可使用MLU复用功能 - - -## 节点需求 - -* neuware-mlu370-driver > 4.15.10 -* cntoolkit > 2.5.3 - -## 开启MLU复用 - -* 通过helm部署本组件, 参照[主文档中的开启vgpu支持章节](https://github.com/Project-HAMi/HAMi/blob/master/README_cn.md#kubernetes开启vgpu支持) - -* 使用以下指令,为MLU节点打上label -``` -kubectl label node {mlu-node} mlu=on -``` - -## 运行MLU任务 - -``` -apiVersion: v1 -kind: Pod -metadata: - name: gpu-pod -spec: - containers: - - name: ubuntu-container - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - cambricon.com/mlunum: 1 # requesting 1 MLU - cambricon.com/mlumem: 10240 # requesting 10G MLU device memory - - name: ubuntu-container1 - image: ubuntu:18.04 - command: ["bash", "-c", "sleep 86400"] - resources: - limits: - cambricon.com/mlunum: 1 # requesting 1 MLU - cambricon.com/mlumem: 10240 # requesting 10G MLU device memory -``` - -## 注意事项 - -1. 在init container中无法使用MLU复用功能,否则该任务不会被调度 - -2. MLU复用功能目前不支持containerd,在containerd中使用会导致任务失败 - -3. 只有MLU-370可以使用MLU复用功能 diff --git a/docs/config.md b/docs/config.md deleted file mode 100644 index 8da64b9..0000000 --- a/docs/config.md +++ /dev/null @@ -1,49 +0,0 @@ -# Global Config - -you can customize your vGPU support by setting the following parameters using `-set`, for example - -``` -helm install vgpu-charts/vgpu vgpu --set devicePlugin.deviceMemoryScaling=5 ... -``` - -* `devicePlugin.service.schedulerPort:` - Integer type, by default: 31998, scheduler webhook service nodePort. -* `devicePlugin.deviceMemoryScaling:` - Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `devicePlugin.deviceMemoryScaling` argument to *S*, vGPUs splitted by this GPU will totally get `S * M` memory in Kubernetes with our device plugin. -* `devicePlugin.deviceSplitCount:` - Integer type, by default: equals 10. Maximum tasks assigned to a simple GPU device. -* `devicePlugin.migstrategy:` - String type, "none" for ignoring MIG features or "mixed" for allocating MIG device by seperate resources. Default "none" -* `devicePlugin.disablecorelimit:` - String type, "true" for disable core limit, "false" for enable core limit, default: false -* `scheduler.defaultMem:` - Integer type, by default: 5000. The default device memory of the current task, in MB -* `scheduler.defaultCores:` - Integer type, by default: equals 0. Percentage of GPU cores reserved for the current task. If assigned to 0, it may fit in any GPU with enough device memory. If assigned to 100, it will use an entire GPU card exclusively. -* `scheduler.defaultGPUNum:` - Integer type, by default: equals 1, if configuration value is 0, then the configuration value will not take effect and will be filtered. when a user does not set nvidia.com/gpu this key in pod resource, webhook should check nvidia.com/gpumem、resource-mem-percentage、nvidia.com/gpucores this three key, anyone a key having value, webhook should add nvidia.com/gpu key and this default value to resources limits map. -* `resourceName:` - String type, vgpu number resource name, default: "nvidia.com/gpu" -* `resourceMem:` - String type, vgpu memory size resource name, default: "nvidia.com/gpumem" -* `resourceMemPercentage:` - String type, vgpu memory fraction resource name, default: "nvidia.com/gpumem-percentage" -* `resourceCores:` - String type, vgpu cores resource name, default: "nvidia.com/cores" -* `resourcePriority:` - String type, vgpu task priority name, default: "nvidia.com/priority" - -# Container config envs - -* `GPU_CORE_UTILIZATION_POLICY:` - String type, "default", "force", "disable" - "default" means the dafault utilization policy - "force" means the container will always limit the core utilization below "nvidia.com/gpucores" - "disable" means the container will ignore the utilization limitation set by "nvidia.com/gpucores" during task execution - -* `ACTIVE_OOM_KILLER:` - String type, "true","false" - "true" means the task may be killed if exceeds the limitation set by "nvidia.com/gpumem" or "nvidia.com/gpumemory" - "false" means the task will not be killed even it exceeds the limitation. - - diff --git a/docs/config_cn.md b/docs/config_cn.md deleted file mode 100644 index d475d9f..0000000 --- a/docs/config_cn.md +++ /dev/null @@ -1,41 +0,0 @@ -# 全局配置 - -你可以在安装过程中,通过`-set`来修改以下的客制化参数,例如: - -``` -helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceMemoryScaling=5 ... -``` - -* `devicePlugin.deviceSplitCount:` - 整数类型,预设值是10。GPU的分割数,每一张GPU都不能分配超过其配置数目的任务。若其配置为N的话,每个GPU上最多可以同时存在N个任务。 -* `devicePlugin.deviceMemoryScaling:` - 浮点数类型,预设值是1。NVIDIA装置显存使用比例,可以大于1(启用虚拟显存,实验功能)。对于有*M*显存大小的NVIDIA GPU,如果我们配置`devicePlugin.deviceMemoryScaling`参数为*S*,在部署了我们装置插件的Kubenetes集群中,这张GPU分出的vGPU将总共包含 `S * M` 显存。 -* `devicePlugin.migStrategy:` - 字符串类型,目前支持"none“与“mixed“两种工作方式,前者忽略MIG设备,后者使用专门的资源名称指定MIG设备,使用详情请参考mix_example.yaml,默认为"none" -* `devicePlugin.disablecorelimit:` - 字符串类型,"true"为关闭算力限制,"false"为启动算力限制,默认为"false" -* `scheduler.defaultMem:` - 整数类型,预设值为5000,表示不配置显存时使用的默认显存大小,单位为MB -* `scheduler.defaultCores:` - 整数类型(0-100),默认为0,表示默认为每个任务预留的百分比算力。若设置为0,则代表任务可能会被分配到任一满足显存需求的GPU中,若设置为100,代表该任务独享整张显卡 -* `scheduler.defaultGPUNum:` - 整数类型,默认为1,如果配置为0,则配置不会生效。当用户在 pod 资源中没有设置 nvidia.com/gpu 这个 key 时,webhook 会检查 nvidia.com/gpumem、resource-mem-percentage、nvidia.com/gpucores 这三个 key 中的任何一个 key 有值,webhook 都会添加 nvidia.com/gpu 键和此默认值到 resources limit中。 -* `resourceName:` - 字符串类型, 申请vgpu个数的资源名, 默认: "nvidia.com/gpu" -* `resourceMem:` - 字符串类型, 申请vgpu显存大小资源名, 默认: "nvidia.com/gpumem" -* `resourceMemPercentage:` - 字符串类型,申请vgpu显存比例资源名,默认: "nvidia.com/gpumem-percentage" -* `resourceCores:` - 字符串类型, 申请vgpu算力资源名, 默认: "nvidia.com/cores" -* `resourcePriority:` - 字符串类型,表示申请任务的任务优先级,默认: "nvidia.com/priority" - -# 容器配置(在容器的环境变量中指定) - -* `GPU_CORE_UTILIZATION_POLICY:` - 字符串类型,"default", "force", "disable" - 代表容器算力限制策略, "default"为默认,"force"为强制限制算力,一般用于测试算力限制的功能,"disable"为忽略算力限制 -* `ACTIVE_OOM_KILLER:` - 字符串类型,"true", "false" - 代表容器是否会因为超用显存而被终止执行,"true"为会,"false"为不会 \ No newline at end of file diff --git a/docs/dashboard.md b/docs/dashboard.md deleted file mode 100644 index 878d70b..0000000 --- a/docs/dashboard.md +++ /dev/null @@ -1,54 +0,0 @@ -## Grafana Dashboard - -- You can load this dashboard json file [gpu-dashboard.json](./gpu-dashboard.json) - -- This dashboard also includes some NVIDIA DCGM metrics: - - [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) deploy:`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml` - -- use this prometheus custom metric configure: - -```yaml -- job_name: 'kubernetes-vgpu-exporter' - kubernetes_sd_configs: - - role: endpoints - relabel_configs: - - source_labels: [__meta_kubernetes_endpoints_name] - regex: vgpu-device-plugin-monitor - replacement: $1 - action: keep - - source_labels: [__meta_kubernetes_pod_node_name] - regex: (.*) - target_label: node_name - replacement: ${1} - action: replace - - source_labels: [__meta_kubernetes_pod_host_ip] - regex: (.*) - target_label: ip - replacement: $1 - action: replace -- job_name: 'kubernetes-dcgm-exporter' - kubernetes_sd_configs: - - role: endpoints - relabel_configs: - - source_labels: [__meta_kubernetes_endpoints_name] - regex: dcgm-exporter - replacement: $1 - action: keep - - source_labels: [__meta_kubernetes_pod_node_name] - regex: (.*) - target_label: node_name - replacement: ${1} - action: replace - - source_labels: [__meta_kubernetes_pod_host_ip] - regex: (.*) - target_label: ip - replacement: $1 - action: replace -``` - -- reload promethues: - -```bash -curl -XPOST http://{promethuesServer}:{port}/-/reload -``` diff --git a/docs/dashboard_cn.md b/docs/dashboard_cn.md deleted file mode 100644 index 20ee3ec..0000000 --- a/docs/dashboard_cn.md +++ /dev/null @@ -1,53 +0,0 @@ -## Grafana Dashboard - -- 你可以在 grafana 中导入此 [gpu-dashboard.json](./gpu-dashboard.json) -- 此 dashboard 还包括一部分 NVIDIA DCGM 监控指标: - - [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)部署:`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml` - -- 添加 prometheus 自定义的监控项: - -```yaml -- job_name: 'kubernetes-vgpu-exporter' - kubernetes_sd_configs: - - role: endpoints - relabel_configs: - - source_labels: [__meta_kubernetes_endpoints_name] - regex: vgpu-device-plugin-monitor - replacement: $1 - action: keep - - source_labels: [__meta_kubernetes_pod_node_name] - regex: (.*) - target_label: node_name - replacement: ${1} - action: replace - - source_labels: [__meta_kubernetes_pod_host_ip] - regex: (.*) - target_label: ip - replacement: $1 - action: replace -- job_name: 'kubernetes-dcgm-exporter' - kubernetes_sd_configs: - - role: endpoints - relabel_configs: - - source_labels: [__meta_kubernetes_endpoints_name] - regex: dcgm-exporter - replacement: $1 - action: keep - - source_labels: [__meta_kubernetes_pod_node_name] - regex: (.*) - target_label: node_name - replacement: ${1} - action: replace - - source_labels: [__meta_kubernetes_pod_host_ip] - regex: (.*) - target_label: ip - replacement: $1 - action: replace -``` - -- 加载 promethues 配置: - -```bash -curl -XPOST http://{promethuesServer}:{port}/-/reload -``` diff --git a/docs/develop/design.md b/docs/develop/design.md deleted file mode 100644 index 02b5800..0000000 --- a/docs/develop/design.md +++ /dev/null @@ -1,28 +0,0 @@ -# Design - - - -The architect of HAMi is shown in the figure above, It is organized in the form of "chart". - -- MutatingWebhook - -The MutatingWebhook checks the validity of each task, and set the "schedulerName" to "HAMi scheduler" if the resource requests have been recognized by HAMi -If Not, the MutatingWebhook does nothing and pass this task to default-scheduler. - -- Scheduler - -HAMi support default kube-scheduler and volcano-scheduler, it implements an extender and register 'Filter' and 'Score' methods to deal with sharable devices. -When a pod with sharable device request arrives, 'Filter' searches the cluster and returns a list of 'available' nodes. 'Score' scores each node 'Filter' returned, and pick the highest one to host the pod. It patches the schedule decision on corresponding pod annotations, for the detailed protocol, see protocol.md - -- DevicePlugin - -When the schedule decision is made, scheduler calls devicePlugin on that node to generate environment variables and mounts according to pod annotations. -Please note that, the DP used here is a customized version, you need to install according to README document with that device. Most officaial DP will not fit in HAMi, and will result in unexpected behaviour - -- InContainer Control - -The implementation of in-container hard limit is different for diffent devices. For example, HAMi-Core is responsible for NVIDIA devices. libnvidia-control.so is responsible for iluvatar devices, etc. HAMi needs to pass the correct environment variables in order for it to operate. - - - -In summary, The flowchart of pod is descirbed as the figure above. diff --git a/docs/develop/imgs/flowchart.jpeg b/docs/develop/imgs/flowchart.jpeg deleted file mode 100644 index 1cbe0a5..0000000 Binary files a/docs/develop/imgs/flowchart.jpeg and /dev/null differ diff --git a/docs/develop/imgs/offline_validation.png b/docs/develop/imgs/offline_validation.png deleted file mode 100644 index 8dec962..0000000 Binary files a/docs/develop/imgs/offline_validation.png and /dev/null differ diff --git a/docs/develop/imgs/protocol_pod.png b/docs/develop/imgs/protocol_pod.png deleted file mode 100644 index 0fff3c6..0000000 Binary files a/docs/develop/imgs/protocol_pod.png and /dev/null differ diff --git a/docs/develop/imgs/protocol_register.png b/docs/develop/imgs/protocol_register.png deleted file mode 100644 index 94c2529..0000000 Binary files a/docs/develop/imgs/protocol_register.png and /dev/null differ diff --git a/docs/develop/protocol.md b/docs/develop/protocol.md deleted file mode 100644 index 0473b98..0000000 --- a/docs/develop/protocol.md +++ /dev/null @@ -1,67 +0,0 @@ -# Protocol - -## Device Register - - - -HAMi needs to know the spec of each AI devices in the cluster in order to schedule properly. During device registration, device-plugin needs to keep patching the spec of each device into node annotations every 30 seconds, in the format of the following: - -``` -hami.io/node-handshake-{device-type}: Reported_{device_node_current_timestamp} -hami.io/node-register-{deivce-type}: {Device 1}:{Device2}:...:{Device N} -``` - -The definiation of each device is in the following format: -``` -{Device UUID},{device split count},{device memory limit},{device core limit},{device type},{device numa},{healthy} -``` - -An example is shown below: -``` -hami.io/node-handshake-nvidia: Reported 2024-01-23 04:30:04.434037031 +0000 UTC m=+1104711.777756895 -hami.io/node-handshake-mlu: Requesting_2024.01.10 04:06:57 -hami.io/node-mlu-register: MLU-45013011-2257-0000-0000-000000000000,10,23308,0,MLU-MLU370-X4,0,false:MLU-54043011-2257-0000-0000-000000000000,10,23308,0, -hami.io/node-nvidia-register: GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec,10,32768,100,NVIDIA-Tesla V100-PCIE-32GB,0,true:GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448,10,32768,100,NVIDIA-Tesla V100-PCIE-32GB,0,true: - -``` -In this example, this node has two different AI devices, 2 Nvidia-V100 GPUs, and 2 Cambircon 370-X4 MLUs - -Note that a device node may become unavailable due to hardware or network failure, if a node hasn't registered in last 5 minutes, scheduler will mark that node as 'unavailable'. - -Since system clock on scheduler node and 'device' node may not align properly, scheduler node will patch the following device node annotations every 30s - -``` -hami.io/node-handshake-{device-type}: Requesting_{scheduler_node_current_timestamp} -``` - -If hami.io/node-handshake annotations remains in "Requesting_xxxx" and {scheduler current timestamp} > 5 mins + {scheduler timestamp in annotations}, then this device on that node will be marked "unavailable" in scheduler. - - -## Schedule Decision - - - -HAMi scheduler needs to patch schedule decisions into pod annotations, in the format of the following: - -``` -hami.io/devices-to-allocate:{ctr1 request}:{ctr2 request}:...{Last ctr request}: -hami.io/device-node: {schedule decision node} -hami.io/device-schedule-time: {timestamp} -``` - -each container request is in the following format: - -``` -{device UUID},{device type keywork},{device memory request}:{device core request} -``` - -for example: - -A pod with 2 containers, first container requests 1 GPU with 3G device Memory, second container requests 1 GPU with 5G device Memory, then the patched annotations will be like the - -``` -hami.io/devices-to-allocate: GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448,NVIDIA,3000,0:GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448,NVIDIA,5000,0: -hami.io/vgpu-node: node67-4v100 -hami.io/vgpu-time: 1705054796 -``` - diff --git a/docs/develop/roadmap.md b/docs/develop/roadmap.md deleted file mode 100644 index 70fd2b3..0000000 --- a/docs/develop/roadmap.md +++ /dev/null @@ -1,10 +0,0 @@ -# roadmap - -| feature | description | release | Example | Example expected behaviour | -|--------------------|----------------------------------------------------------------------------------------------------------------------------------------|---------------|--------------|------------| -| Kubernetes schedule layer | Support Resource Quota for vgpu-memory | v3.2.0 | "requests.nvidia.com/gpu-memory: 30000" in ResourceQuota | Pods in this namespace can allocate up to 30G device memory in this namespace | -| | Support Best-fit, idle-first, Numa-first Schedule Policy | v3.2.0 | add "scheduler policy configmap" | execute schedule policy according to configMap | -| | Support k8s 1.28 version with compatable to v1.16 | v3.1.0 | | | -| Add more Heterogeneous AI computing device | HuaWei Ascend Support | v3.1.0 | | | -| | Iluvatar GPU support | v3.1.0 | | | -| |Teco DPU Support | v3.2.0 | | | diff --git a/docs/develop/tasklist.md b/docs/develop/tasklist.md deleted file mode 100644 index 873366f..0000000 --- a/docs/develop/tasklist.md +++ /dev/null @@ -1,118 +0,0 @@ -# Tasks - -## Support Moore threads MTT S4000 - -``` -resources: -requests: - mthreads.com/gpu: ${num} - mthreads.com/vcuda-core: ${core} - mthreads.com/vcuda-memory: ${mem} -limits: - mthreads.com/gpu: ${num} - mthreads.com/vcuda-core: ${core} - mthreads.com/vcuda-memory: ${mem} -``` - -## Support Birentech Model 110 - -``` -resources: -requests: - birentech.com/gpu: ${num} - birentech.com/vcuda-core: ${core} - birentech.com/vcuda-memory: ${mem} -limits: - birentech.com/gpu: ${num} - birentech.com/vcuda-core: ${core} - birentech.com/vcuda-memory: ${mem} -``` - -## Support iluvatar MR-V100 - -``` -resources: -requests: - iluvatar.ai/gpu: ${num} - iluvatar.ai/vcuda-core: ${core} - iluvatar.ai/vcuda-memory: ${mem} -limits: - iluvatar.ai/gpu: ${num} - iluvatar.ai/vcuda-core: ${core} - iluvatar.ai/vcuda-memory: ${mem} -``` - -## Support HuaWei Ascend 910B device - -``` -resources: - requests: - ascend.com/npu: ${num} - ascend.com/npu-core: ${core} - ascend.com/npu-mem: ${mem} - limits: - ascend.com/npu: ${num} - ascend.com/npu-core: ${core} - ascend.com/npu-mem: ${mem} -``` - -## Support resourceQuota for Kubernetes - -Description: ResourceQuota is frequently used in kubernetes namespace. Since the number of virtual devices doesn't mean anything, we need to support the limitation in deviceMemory. - -For example, the following resourceQuota -``` -cat < compute-resources.yaml -apiVersion: v1 -kind: ResourceQuota -metadata: - name: compute-resources -spec: - hard: - requests.cpu: "1" - requests.memory: 1Gi - limits.cpu: "2" - limits.memory: 2Gi - requests.nvidia.com/gpu-memory: 30000 -EOF -``` - -with the following command -``` -kubectl create -f ./compute-resources.yaml--namespace=myspace -``` - -will limit the maxinum device memory allocated to namespace 'myspace' to 30G - -## Support multiple schedule policies - -Description: HAMi needs to support multiple schedule policies, to provide meets the need in complex senarios, a pod can select a schedule policy in annotations field. - -The effect of each schedule policy is shown in the table below - -| Schedule Policy | Effect | -| -------- | ------- | -| best-fit | the fewer device memory remains, the higher score | -| idle-first | idle GPU has higher score | -| numa-first | for multiple GPU allocations, GPUs on the same numa have higher score | - - -For example, if a pod want to select a 'best-fit' schedule policy, it can specify .metadata.annotations as the code below: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: gpu-pod - annotations: - nvidia.com/schedule-policy: "best-fit" -spec: - containers: - - name: ubuntu-container - image: ubuntu:18.04 - command:["bash","-c","sleep 86400"] - resources: - limits: - nvidia.com/gpu: 2 # requesting 2 VGPUs -``` - diff --git a/docs/gpu-dashboard.json b/docs/gpu-dashboard.json deleted file mode 100644 index d361708..0000000 --- a/docs/gpu-dashboard.json +++ /dev/null @@ -1,1054 +0,0 @@ -{ - "annotations": { - "list": [ - { - "$$hashKey": "object:192", - "builtIn": 1, - "datasource": "-- Grafana --", - "enable": true, - "hide": true, - "iconColor": "rgba(0, 211, 255, 1)", - "name": "Annotations & Alerts", - "type": "dashboard" - } - ] - }, - "description": "This dashboard is gpu metrics dashboard base on NVIDIA DCGM Exporter and 4paradigm/k8s-vgpu-scheduler", - "editable": true, - "gnetId": 12239, - "graphTooltip": 0, - "id": 46, - "iteration": 1694498903162, - "links": [], - "panels": [ - { - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "max": 100, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "#EAB839", - "value": 83 - }, - { - "color": "red", - "value": 87 - } - ] - }, - "unit": "celsius" - }, - "overrides": [] - }, - "gridPos": { - "h": 10, - "w": 4, - "x": 0, - "y": 0 - }, - "id": 14, - "options": { - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "mean" - ], - "fields": "", - "values": false - }, - "showThresholdLabels": false, - "showThresholdMarkers": true, - "text": {} - }, - "pluginVersion": "7.5.17", - "targets": [ - { - "expr": "avg(DCGM_FI_DEV_GPU_TEMP{node_name=~\"${node_name}\", gpu=~\"${gpu}\"})", - "interval": "", - "legendFormat": "", - "refId": "A" - } - ], - "timeFrom": null, - "timeShift": null, - "title": "GPU平均温度", - "type": "gauge" - }, - { - "cacheTimeout": null, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "max": 2400, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "#EAB839", - "value": 1800 - }, - { - "color": "red", - "value": 2200 - } - ] - }, - "unit": "watt" - }, - "overrides": [] - }, - "gridPos": { - "h": 10, - "w": 4, - "x": 4, - "y": 0 - }, - "id": 16, - "links": [], - "options": { - "orientation": "horizontal", - "reduceOptions": { - "calcs": [ - "sum" - ], - "fields": "", - "values": false - }, - "showThresholdLabels": false, - "showThresholdMarkers": true, - "text": {} - }, - "pluginVersion": "7.5.17", - "targets": [ - { - "expr": "sum(DCGM_FI_DEV_POWER_USAGE{node_name=~\"${node_name}\", gpu=~\"${gpu}\"})", - "interval": "", - "legendFormat": "", - "refId": "A" - } - ], - "timeFrom": null, - "timeShift": null, - "title": "GPU总功率", - "type": "gauge" - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 8, - "x": 8, - "y": 0 - }, - "hiddenSeries": false, - "id": 12, - "legend": { - "alignAsTable": true, - "avg": false, - "current": false, - "max": false, - "min": false, - "rightSide": false, - "show": false, - "sort": "current", - "sortDesc": false, - "total": false, - "values": false - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "DCGM_FI_DEV_GPU_TEMP{node_name=~\"${node_name}\", gpu=~\"${gpu}\"}", - "instant": false, - "interval": "", - "legendFormat": "{{node_name}} gpu{{gpu}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "GPU温度", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:97", - "format": "celsius", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - }, - { - "$$hashKey": "object:98", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 8, - "x": 16, - "y": 0 - }, - "hiddenSeries": false, - "id": 2, - "interval": "", - "legend": { - "alignAsTable": true, - "avg": true, - "current": true, - "max": true, - "min": false, - "rightSide": true, - "show": false, - "sideWidth": null, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "DCGM_FI_DEV_SM_CLOCK{node_name=~\"${node_name}\", gpu=~\"${gpu}\"} * 1000000", - "format": "time_series", - "interval": "", - "intervalFactor": 1, - "legendFormat": "{{node_name}} gpu{{gpu}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "GPU SM时钟频率(DCGM_FI_DEV_SM_CLOCK)", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:462", - "decimals": null, - "format": "hertz", - "label": "", - "logBase": 1, - "max": null, - "min": null, - "show": true - }, - { - "$$hashKey": "object:463", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 12, - "x": 0, - "y": 10 - }, - "hiddenSeries": false, - "id": 18, - "legend": { - "avg": true, - "current": false, - "max": true, - "min": false, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "DCGM_FI_DEV_FB_USED{node_name=~\"${node_name}\", gpu=~\"${gpu}\"}", - "interval": "", - "legendFormat": "{{node_name}} gpu{{gpu}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "GPU帧缓存(显存)使用量(DCGM_FI_DEV_FB_USED)", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:618", - "format": "decmbytes", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - }, - { - "$$hashKey": "object:619", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 12, - "x": 12, - "y": 10 - }, - "hiddenSeries": false, - "id": 10, - "legend": { - "alignAsTable": false, - "avg": false, - "current": false, - "max": true, - "min": true, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "DCGM_FI_DEV_POWER_USAGE{node_name=~\"${node_name}\", gpu=~\"${gpu}\"}", - "interval": "", - "legendFormat": "{{node_name}} gpu{{gpu}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "GPU功率消耗(DCGM_FI_DEV_POWER_USAGE)", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:214", - "format": "watt", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - }, - { - "$$hashKey": "object:215", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 12, - "x": 0, - "y": 20 - }, - "hiddenSeries": false, - "id": 6, - "legend": { - "alignAsTable": false, - "avg": false, - "current": false, - "max": true, - "min": true, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "DCGM_FI_DEV_GPU_UTIL{node_name=~\"${node_name}\", gpu=~\"${gpu}\"}", - "interval": "", - "legendFormat": "{{node_name}} gpu{{gpu}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "GPU利用率(DCGM_FI_DEV_GPU_UTIL)", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "cumulative" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:699", - "format": "percent", - "label": null, - "logBase": 1, - "max": "100", - "min": "0", - "show": true - }, - { - "$$hashKey": "object:700", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 10, - "w": 12, - "x": 12, - "y": 20 - }, - "hiddenSeries": false, - "id": 24, - "legend": { - "alignAsTable": false, - "avg": true, - "current": false, - "max": true, - "min": false, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "Device_memory_desc_of_container{node_name=~\"${node_name}\"}", - "interval": "", - "legendFormat": "{{podname}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "4paradigm-pod显存使用量(byte)", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:779", - "format": "bytes", - "label": null, - "logBase": 1, - "show": true - }, - { - "$$hashKey": "object:780", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 11, - "w": 12, - "x": 0, - "y": 30 - }, - "hiddenSeries": false, - "id": 22, - "legend": { - "alignAsTable": false, - "avg": false, - "current": false, - "max": true, - "min": true, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "HostGPUMemoryUsage{node_name=~\"${node_name}\"}", - "interval": "", - "legendFormat": "{{node_name}} gpu {{deviceid}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "4paradigm-节点GPU显存使用量", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:1087", - "format": "mbytes", - "label": null, - "logBase": 1, - "show": true - }, - { - "$$hashKey": "object:1088", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - }, - { - "aliasColors": {}, - "bars": false, - "dashLength": 10, - "dashes": false, - "datasource": "ALL", - "fieldConfig": { - "defaults": { - "links": [] - }, - "overrides": [] - }, - "fill": 1, - "fillGradient": 0, - "gridPos": { - "h": 11, - "w": 12, - "x": 12, - "y": 30 - }, - "hiddenSeries": false, - "id": 20, - "legend": { - "alignAsTable": false, - "avg": false, - "current": false, - "max": true, - "min": true, - "rightSide": false, - "show": true, - "total": false, - "values": true - }, - "lines": true, - "linewidth": 2, - "nullPointMode": "null", - "options": { - "alertThreshold": true - }, - "percentage": false, - "pluginVersion": "7.5.17", - "pointradius": 2, - "points": false, - "renderer": "flot", - "seriesOverrides": [], - "spaceLength": 10, - "stack": false, - "steppedLine": false, - "targets": [ - { - "exemplar": true, - "expr": "HostCoreUtilization{node_name=~\"${node_name}\"}", - "interval": "", - "legendFormat": "{{node_name}} gpu {{deviceid}}", - "refId": "A" - } - ], - "thresholds": [], - "timeFrom": null, - "timeRegions": [], - "timeShift": null, - "title": "4paradigm-节点GPU算力使用率", - "tooltip": { - "shared": true, - "sort": 0, - "value_type": "individual" - }, - "type": "graph", - "xaxis": { - "buckets": null, - "mode": "time", - "name": null, - "show": true, - "values": [] - }, - "yaxes": [ - { - "$$hashKey": "object:1243", - "format": "percent", - "label": null, - "logBase": 1, - "max": "100", - "min": "0", - "show": true - }, - { - "$$hashKey": "object:1244", - "format": "short", - "label": null, - "logBase": 1, - "max": null, - "min": null, - "show": true - } - ], - "yaxis": { - "align": false, - "alignLevel": null - } - } - ], - "refresh": false, - "schemaVersion": 27, - "style": "dark", - "tags": [], - "templating": { - "list": [ - { - "allValue": null, - "current": { - "selected": false - }, - "datasource": "ALL", - "definition": "label_values({__name__=~\"DCGM_FI_DEV_FB_FREE|vGPU_device_memory_limit_in_bytes\"}, node_name)", - "description": null, - "error": null, - "hide": 0, - "includeAll": false, - "label": null, - "multi": true, - "name": "node_name", - "options": [], - "query": { - "query": "label_values({__name__=~\"DCGM_FI_DEV_FB_FREE|vGPU_device_memory_limit_in_bytes\"}, node_name)", - "refId": "StandardVariableQuery" - }, - "refresh": 1, - "regex": "", - "skipUrlSync": false, - "sort": 0, - "tagValuesQuery": "", - "tags": [], - "tagsQuery": "", - "type": "query", - "useTags": false - }, - { - "allValue": null, - "current": { - "selected": false, - "text": [ - "All" - ], - "value": [ - "$__all" - ] - }, - "datasource": "ALL", - "definition": "label_values(DCGM_FI_DEV_FB_FREE{node_name=\"$node_name\"},gpu)", - "description": null, - "error": null, - "hide": 0, - "includeAll": true, - "label": null, - "multi": true, - "name": "gpu", - "options": [], - "query": { - "query": "label_values(DCGM_FI_DEV_FB_FREE{node_name=\"$node_name\"},gpu)", - "refId": "ALL-gpu-Variable-Query" - }, - "refresh": 1, - "regex": "", - "skipUrlSync": false, - "sort": 1, - "tagValuesQuery": "", - "tags": [], - "tagsQuery": "", - "type": "query", - "useTags": false - } - ] - }, - "time": { - "from": "now-12h", - "to": "now" - }, - "timepicker": { - "refresh_intervals": [ - "5s", - "10s", - "30s", - "1m", - "5m", - "15m", - "30m", - "1h", - "2h", - "1d" - ] - }, - "timezone": "", - "title": "k8s-vgpu-scheduler Dashboard", - "uid": "Oxed_c6Wz1", - "version": 3 -} \ No newline at end of file diff --git a/docs/hygon-dcu-support.md b/docs/hygon-dcu-support.md deleted file mode 100644 index 44151bd..0000000 --- a/docs/hygon-dcu-support.md +++ /dev/null @@ -1,94 +0,0 @@ -## Introduction - -**We now support hygon.com/dcu by implementing most device-sharing features as nvidia-GPU**, including: - -***DCU sharing***: Each task can allocate a portion of DCU instead of a whole DCU card, thus DCU can be shared among multiple tasks. - -***Device Memory Control***: DCUs can be allocated with certain device memory size on certain type(i.e Z100) and have made it that it does not exceed the boundary. - -***Device compute core limitation***: DCUs can be allocated with certain percentage of device core(i.e hygon.com/dcucores:60 indicate this container uses 60% compute cores of this device) - -***DCU Type Specification***: You can specify which type of DCU to use or to avoid for a certain task, by setting "hygon.com/use-dcutype" or "hygon.com/nouse-dcutype" annotations. - -## Prerequisites - -* dtk driver with virtualization enabled(i.e dtk-22.10.1-vdcu), try the following command to see if your driver has virtualization ability - -``` -hdmcli -show-device-info -``` - -If this command can't be found, then you should contact your device provider to aquire a vdcu version of dtk driver. - -* The absolute path of dtk driver on each dcu node must be the same(i.e placed in /root/dtk-driver) - -## Enabling DCU-sharing Support - -* Install the chart using helm, See 'enabling vGPU support in kubernetes' section [here](https://github.com/Project-HAMi/HAMi#enabling-vgpu-support-in-kubernetes), please be note that, you should set your dtk driver directory using --set devicePlugin.hygondriver={your dtk driver path on each nodes}, for example: - -``` -helm install vgpu vgpu-charts/vgpu --set devicePlugin.hygondriver="/root/dcu-driver/dtk-22.10.1-vdcu" --set scheduler.kubeScheduler.imageTag={your k8s server version} -n kube-system -``` - -* Tag DCU node with the following command -``` -kubectl label node {dcu-node} dcu=on -``` - -## Running DCU jobs - -Hygon DCUs can now be requested by a container -using the `hygon.com/dcunum` , `hygon.com/dcumem` and `hygon.com/dcucores` resource type: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: alexnet-tf-gpu-pod-mem - labels: - purpose: demo-tf-amdgpu -spec: - containers: - - name: alexnet-tf-gpu-container - image: pytorch:resnet50 - workingDir: /root - command: ["sleep","infinity"] - resources: - limits: - hygon.com/dcunum: 1 # requesting a GPU - hygon.com/dcumem: 2000 # each dcu require 2000 MiB device memory - hygon.com/dcucores: 60 # each dcu use 60% of total compute cores - -``` - -## Enable vDCU inside container - -You need to enable vDCU inside container in order to use it. -``` -source /opt/hygondriver/env.sh -``` - -check if you have successfully enabled vDCU by using following command - -``` -hdmcli -show-device-info -``` - -If you have an output like this, then you have successfully enabled vDCU inside container. - -``` -Device 0: - Actual Device: 0 - Compute units: 60 - Global memory: 2097152000 bytes -``` - -Launch your DCU tasks like you usually do - -## Notes - -1. DCU-sharing in init container is not supported, pods with "hygon.com/dcumem" in init container will never be scheduled. - -2. Only one vdcu can be aquired per container. If you want to mount multiple dcu devices, then you shouldn't set `hygon.com/dcumem` or `hygon.com/dcucores` - - \ No newline at end of file diff --git a/docs/hygon-dcu-support_cn.md b/docs/hygon-dcu-support_cn.md deleted file mode 100644 index b99956e..0000000 --- a/docs/hygon-dcu-support_cn.md +++ /dev/null @@ -1,83 +0,0 @@ -## 简介 - -本组件支持复用海光DCU设备,并为此提供以下几种与vGPU类似的复用功能,包括: - -***DCU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡 - -***可限制分配的显存大小***: 你现在可以用显存值(例如3000M)来分配DCU,本组件会确保任务使用的显存不会超过分配数值 - -***可限制计算单元数量***: 你现在可以指定任务使用的算力比例(例如60即代表使用60%算力)来分配DCU,本组件会确保任务使用的算力不会超过分配数值 - -***指定DCU型号***:当前任务可以通过设置annotation("hygon.com/use-dcutype","hygon.com/nouse-dcutype")的方式,来选择使用或者不使用某些具体型号的DCU - -## 节点需求 - -* 带有虚拟化功能的dtk驱动(例如dtk-22.10.1-vdcu),相关组件可以在海光开发者社区获取,或联系您的设备提供商 - -* 在宿主机上执行hdmcli -show-device-info获取设备信息,若能成功获取,则代表配置成功。若找不到指令,说明您安装的驱动不带有虚拟化功能,请联系厂商获取代虚拟化功能的dtk驱动 - -* 需要将各个DCU节点上的dtk驱动路径放置在统一的绝对路径上,例如均放置在/root/dtk-driver - -## 开启DCU复用 - -* 通过helm部署本组件, 参照[主文档中的开启vgpu支持章节](https://github.com/Project-HAMi/HAMi/blob/master/README_cn.md#kubernetes开启vgpu支持),需要注意的是,必须使用--set devicePlugin.hygondriver="/root/dcu-driver/dtk-22.10.1-vdcu" 手动指定dtk驱动的绝对路径 - -``` -helm install vgpu vgpu-charts/vgpu --set devicePlugin.hygondriver="/root/dcu-driver/dtk-22.10.1-vdcu" --set scheduler.kubeScheduler.imageTag={your k8s server version} -n kube-system -``` - -* 使用以下指令,为DCU节点打上label -``` -kubectl label node {dcu-node} dcu=on -``` - -## 运行DCU任务 - -``` -apiVersion: v1 -kind: Pod -metadata: - name: alexnet-tf-gpu-pod-mem - labels: - purpose: demo-tf-amdgpu -spec: - containers: - - name: alexnet-tf-gpu-container - image: pytorch:resnet50 - workingDir: /root - command: ["sleep","infinity"] - resources: - limits: - hygon.com/dcunum: 1 # requesting a GPU - hygon.com/dcumem: 2000 # each dcu require 2000 MiB device memory - hygon.com/dcucores: 60 # each dcu use 60% of total compute cores - -``` - -## 容器内开启虚拟DCU功能 - -使用vDCU首先需要激活虚拟环境 -``` -source /opt/hygondriver/env.sh -``` - -随后,使用hdmcli指令查看虚拟设备是否已经激活 -``` -hdmcli -show-device-info -``` - -若输出如下,则代表虚拟设备已经成功激活 -``` -Device 0: - Actual Device: 0 - Compute units: 60 - Global memory: 2097152000 bytes -``` - -接下来正常启动DCU任务即可 - -## 注意事项 - -1. 在init container中无法使用DCU复用功能,否则该任务不会被调度 - -2. 每个容器最多只能使用一个虚拟DCU设备, 如果您希望在容器中挂载多个DCU设备,则不能使用`hygon.com/dcumem`和`hygon.com/dcucores`字段 diff --git a/docs/iluvatar-gpu-support.md b/docs/iluvatar-gpu-support.md deleted file mode 100644 index 77815ed..0000000 --- a/docs/iluvatar-gpu-support.md +++ /dev/null @@ -1,86 +0,0 @@ -## Introduction - -**We now support iluvatar.ai/gpu by implementing most device-sharing features as nvidia-GPU**, including: - -***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks. - -***Device Memory Control***: GPUs can be allocated with certain device memory size on certain type(i.e m100) and have made it that it does not exceed the boundary. - -***Device Core Control***: GPUs can be allocated with limited compute cores on certain type(i.e m100) and have made it that it does not exceed the boundary. - -***Very Easy to use***: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation. - -## Prerequisites - -* Iluvatar gpu-manager (please consult your device provider) -* driver version > 3.1.0 - -## Enabling GPU-sharing Support - -* Deploy gpu-manager on iluvatar nodes (Please consult your device provider to aquire its package and document) - -> **NOTICE:** *Install only gpu-manager, don't install gpu-admission package.* - -* Identify the resource name about core and memory usage(i.e 'iluvatar.ai/vcuda-core', 'iluvatar.ai/vcuda-memory') - -* set the 'iluvatarResourceMem' and 'iluvatarResourceCore' parameters when install hami - -``` -helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set iluvatarResourceMem=iluvatar.ai/vcuda-memory --set iluvatarResourceCore=iluvatar.ai/vcuda-core -n kube-system -``` - -## Running Iluvatar jobs - -Iluvatar GPUs can now be requested by a container -using the `iluvatar.ai/vgpu`, `iluvatar.ai/vcuda-memory` and `iluvatar.ai/vcuda-core` resource type: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: poddemo -spec: - restartPolicy: Never - containers: - - name: poddemo - image: harbor.4pd.io/vgpu/corex_transformers@sha256:36a01ec452e6ee63c7aa08bfa1fa16d469ad19cc1e6000cf120ada83e4ceec1e - command: - - bash - args: - - -c - - | - set -ex - echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc - cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/ - cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/ - source /root/.bashrc - sleep 360000 - resources: - requests: - iluvatar.ai/vgpu: 1 - iluvatar.ai/vcuda-core: 50 - iluvatar.ai/vcuda-memory: 64 - limits: - iluvatar.ai/vgpu: 1 - iluvatar.ai/vcuda-core: 50 - iluvatar.ai/vcuda-memory: 64 -``` - -> **NOTICE1:** *Each unit of vcuda-memory indicates 256M device memory* - -> **NOTICE2:** *You can find more examples in [examples/iluvatar folder](../examples/iluvatar/)* - -## Notes - -1. You need to set the following prestart command in order for the device-share to work properly -``` - set -ex - echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc - cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/ - cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/ - source /root/.bashrc -``` - -2. Virtualization takes effect only for containers that apply for one GPU(i.e iluvatar.ai/vgpu=1 ) - - \ No newline at end of file diff --git a/docs/iluvatar-gpu-support_cn.md b/docs/iluvatar-gpu-support_cn.md deleted file mode 100644 index ef17980..0000000 --- a/docs/iluvatar-gpu-support_cn.md +++ /dev/null @@ -1,84 +0,0 @@ -## 简介 - -本组件支持复用天数智芯GPU设备,并为此提供以下几种与vGPU类似的复用功能,包括: - -***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡 - -***可限制分配的显存大小***: 你现在可以用显存值(例如3000M)来分配MLU,本组件会确保任务使用的显存不会超过分配数值,注意只有M100型号的M150支持可配显存 - -***可限制分配的算力核组比例***: 你现在可以用算力比例(例如60%)来分配GPU,本组件会确保任务使用的显存不会超过分配数值,注意只有M100型号的M150支持可配算力比例 - -***方便易用***: 部署本组件后,只需要部署厂家提供的gpu-manager即可使用 - - -## 节点需求 - -* Iluvatar gpu-manager (please consult your device provider) -* driver version > 3.1.0 - -## 开启GPU复用 - -* 部署'gpu-manager',天数智芯的GPU共享需要配合厂家提供的'gpu-manager'一起使用,请联系设备提供方获取 - -> **注意:** *只需要安装gpu-manager,不要安装gpu-admission.* - -* 部署'gpu-manager'之后,你需要确认显存和核组对应的资源名称(例如 'iluvatar.ai/vcuda-core', 'iluvatar.ai/vcuda-memory') - -* 在安装HAMi时配置'iluvatarResourceMem'和'iluvatarResourceCore'参数 - -``` -helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set iluvatarResourceMem=iluvatar.ai/vcuda-memory --set iluvatarResourceCore=iluvatar.ai/vcuda-core -n kube-system -``` - -## 运行GPU任务 - -``` -apiVersion: v1 -kind: Pod -metadata: - name: poddemo -spec: - restartPolicy: Never - containers: - - name: poddemo - image: harbor.4pd.io/vgpu/corex_transformers@sha256:36a01ec452e6ee63c7aa08bfa1fa16d469ad19cc1e6000cf120ada83e4ceec1e - command: - - bash - args: - - -c - - | - set -ex - echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc - cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/ - cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/ - source /root/.bashrc - sleep 360000 - resources: - requests: - iluvatar.ai/vgpu: 1 - iluvatar.ai/vcuda-core: 50 - iluvatar.ai/vcuda-memory: 64 - limits: - iluvatar.ai/vgpu: 1 - iluvatar.ai/vcuda-core: 50 - iluvatar.ai/vcuda-memory: 64 -``` - -> **注意1:** *每一单位的vcuda-memory代表256M的显存.* - -> **注意2:** *查看更多的[用例](../examples/iluvatar/).* - -## 注意事项 - -1. 你需要在容器中进行如下的设置才能正常的使用共享功能 -``` - set -ex - echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc - cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/ - cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/ - source /root/.bashrc -``` - -2. 共享模式只对申请一张GPU的容器生效(iluvatar.ai/vgpu=1) - - diff --git a/docs/offline-install.md b/docs/offline-install.md deleted file mode 100644 index 0a53ee8..0000000 --- a/docs/offline-install.md +++ /dev/null @@ -1,57 +0,0 @@ -# Offline-install Maunal - -For some cluster that don't have external web access, you can install HAMi by the following step: - -1. Refer to [README.md](../README.md) until step 'Install and Uninstall' - -2. copy the source of project into the master node in your cluster, placed in a path like "/root/HAMi" - -3. pull the following images and save them into a '.tar' file, then move it into the master node in your cluster - -Image list: -``` -4pdosc/k8s-vdevice:{HAMi version} -docker.io/jettech/kube-webhook-certgen:v1.5.2 -liangjw/kube-webhook-certgen:v1.1.1 -registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:{your kubernetes version} -``` - -``` -docker pull {iamge} && docker save {image_name} -o {image_name}.tar -``` - -4. Load these images using docker load, tag these images with your registry, and push them into your registry - -``` -docker load -i {image_name}.tar -docker tag 4pdosc/k8s-vdevice:{HAMi version} {registry}/k8s-vdevice:{HAMi version} -docker push {registry}/k8s-vdevice:{HAMi version} -``` - -5. edit the following field in /root/HAMi/chart/vgpu/values.yaml to your image pushed - -``` -scheduler.kubeScheduler.image -scheduler.extender.image -scheduler.patch.image -scheduler.patch.imageNew -scheduler.devicePlugin.image -scheduler.devicePlugin.monitorimage -``` - -6. Execute the following command in your /root/HAMi/chart folder - -``` -helm install vgpu vgpu --set scheduler.kubeScheduler.imageTag={你的k8s server版本} -n kube-system -``` - -7. Verify your installation - -execute the following command -``` -kubectl get pods -n kube-system -``` - -If you can see both the 'device-plugin' and 'schduler' running, then HAMi is installed successfully, as the figure shown below: - - diff --git a/imgs/arch.png b/imgs/arch.png deleted file mode 100644 index 1eafd88..0000000 Binary files a/imgs/arch.png and /dev/null differ diff --git a/imgs/benchmark.png b/imgs/benchmark.png deleted file mode 100644 index 3de68fd..0000000 Binary files a/imgs/benchmark.png and /dev/null differ diff --git a/imgs/benchmark_inf.png b/imgs/benchmark_inf.png deleted file mode 100644 index ec52cb5..0000000 Binary files a/imgs/benchmark_inf.png and /dev/null differ diff --git a/imgs/benchmark_train.png b/imgs/benchmark_train.png deleted file mode 100644 index 78eaa92..0000000 Binary files a/imgs/benchmark_train.png and /dev/null differ diff --git a/imgs/example.png b/imgs/example.png deleted file mode 100644 index 0f407f4..0000000 Binary files a/imgs/example.png and /dev/null differ diff --git a/imgs/hard_limit.jpg b/imgs/hard_limit.jpg deleted file mode 100644 index 554bfbb..0000000 Binary files a/imgs/hard_limit.jpg and /dev/null differ