nebuly-nvidia-device plugin crash on new partitioning / config change #57

hasenbam · 2024-08-01T13:16:53Z

I deployed nos with nebuly-nvidia device plugin in MPS partitioning mode.
Whenever I deploy a deployment/pods that require a change of GPU partitioning by the GPU partitioner, the nebuly-nvidia-device plugin crashes.

I tried to follow whats happening and this is what I guess:

new deployment gets applied, GPU partitionier checks pending pods if partitioning needs change
GPU partitioner performs new partitioning and writes to config and references the new config in the node-label nvidia.com/device-plugin.config.
At the same time nebuly-device plugin is triggerd by label change and tries to read the new config referenced by the label.
The config referenced does not exist (yet?) maybe - is this a timing issue, that for instance the config takes a second to become active?
The non-existing config causes the nebuly-device-plugin to crash. Because this happens every time a new partitioning is necessary, after some time we run into the k8s CrashLoopBackoff, meaning that the restart of the nebuly-device-plugin takes 5 minutes. After 5 minutes and the restart the new partitioning becomes active and the pending pods start quickly with access to their configured MPS GPU fractions.

Here is the output of the logs of the nebuly-nvidia-device plugin. You can see at 13:05 I deployed a Deployment with a pod requesting a nvidia.com/gpu-2gb which triggered a new partitioning and caused the crash:

kubectl logs pod/nvidia-device-plugin-1722514861-rrdhz -n nebuly-nvidia --follow
Defaulted container "nvidia-device-plugin-sidecar" out of: nvidia-device-plugin-sidecar, nvidia-mps-server, nvidia-device-plugin-ctr, set-compute-mode (init), set-nvidia-mps-volume-permissions (init), nvidia-device-plugin-init (init)
W0801 13:02:37.159120     270 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:02:37Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Updating to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Successfully updated to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Sending signal 'hangup' to 'nvidia-device-plugin'"
time="2024-08-01T13:02:37Z" level=info msg="Successfully sent signal"
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:05:02Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517497"
time="2024-08-01T13:05:02Z" level=info msg="Error: specified config vm125-1722517497 does not exist"

I mean it is still working but with this it takes always 5 minutes for my pods to start when partitioning changes :(

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nebuly-nvidia-device plugin crash on new partitioning / config change #57

nebuly-nvidia-device plugin crash on new partitioning / config change #57

hasenbam commented Aug 1, 2024

nebuly-nvidia-device plugin crash on new partitioning / config change #57

nebuly-nvidia-device plugin crash on new partitioning / config change #57

Comments

hasenbam commented Aug 1, 2024