[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54

lohbe · 2024-03-15T11:11:39Z

Suggestion Description

This is more of a question at this point. CDNA3 MI300x supports up to 8 x partitions per card via SR-IOV. Can k8s-device-plugin

detect the partitions?
allocate the partitions? with strategies (like NVIDIA MIG)?
reset the partitioning strategy?

Operating System

No response

GPU

CDNA, MI300x

ROCm Component

k8s-device-plugin

The text was updated successfully, but these errors were encountered:

boniek83 · 2024-09-04T22:32:14Z

We have a couple of AS -8125GS-TNMR2 machines with mi300x and suffer greatly due to this as well.
Here's NVidia's documentation of this topic https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
It would be great to have similar functionality (especially including allocation strategies) available with AMD hardware under kubernetes.

The only major thing lacking in NVidia's implementation is allocation of MIG instances on demand - they are all statically allocated which is a serious PITA and not elastic at all. They should be created when requested (eg nvidia.com/mig-1g.5gb: 1) and destroyed when pod is done and when nvidia.com/gpu: 1 is requested full gpu should be attached to a pod and this should be possible all at the same time (of course nvidia.com/mig-1g.5gb and nvidia.com/gpu should be completely different physical gpus if requested at the same time). This would/might create scheduling issues (fragmentation) but nevertheless should be available as an option as this has potential to better utilize available resources and doesn't require administrator to be omniscient when statically allocating MIGs.

boniek83 · 2024-09-18T16:52:24Z

Very interesting deep dive about what is and what is going to be possible on NVidia hardware with kubernetes:
https://www.youtube.com/watch?app=desktop&v=qDfFL78QcnQ
It seems like CDI is the key to the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54

[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54

lohbe commented Mar 15, 2024

boniek83 commented Sep 4, 2024 •

edited

Loading

boniek83 commented Sep 18, 2024

[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54

[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54

Comments

lohbe commented Mar 15, 2024

Suggestion Description

Operating System

GPU

ROCm Component

boniek83 commented Sep 4, 2024 • edited Loading

boniek83 commented Sep 18, 2024

boniek83 commented Sep 4, 2024 •

edited

Loading