-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nebuly k8s-device-plugin not starting on GKE #36
Comments
After a deeper dive I've discovered that this is due to the device plugin being marked as |
Hi @lmyslinski, thank you for raising the issue! The lack of For L4 GPUs, GKE requires version 1.22.17-gke.5400 or later and NVIDIA driver version 525 or later (here you can find the full requirements). I've seen you're using v1.24.11-gke.1000, so my guess is that the problem is related to NVIDIA drivers. Could you please try either changing GPU model or upgrading your GKE? |
Hello, I believe I'm running into a similar issue. The whole system works well and can utilize the GPUs using a standard gpu-operator / device plugin install:
Now, I'm trying to enable MPS using your fork of the nvidia-device-plugin. I uninstall gpu-operator. Then, I reinstall it without the standard device plugin using:
Then, I install your device plugin:
It install successfully, but the
Cluster data:
|
I have the same problem as @willcray |
@santurini I don't remember anything regarding this setup since I've long moved on, but I've created a detailed post regarding all of the GPU Operator stuff: https://lmyslinski.com/posts/gpu-operator-guide/ At quick glance, if nvidia-smi is not found that means that the driver is not working / not installed |
Have you found a different solution for enabling MPS in a kubernetes cluster? |
Hi, I'm trying to setup MPS partitioning on GKE, but I can't get the k8s-device-plugin to work. The plugin gets installed correctly, but it never starts any driver pods.
Cluster data:
The node only has the following taints:
It's also properly labeled as
The regular nvidia device plugin has worked just fine before I pushed it out with nodeSelectors on the default daemonset injected on GKE.
The nebuly plugin however is stuck at 0 pods:
Your documentation mentions that in order to avoid duplicate drivers on nodes, we can configure affinity on the prexisting nvidia driver to avoid scheduling both on the nodes. I've done that for the GKE driver daemonset, but that results in a container that's always stuck in
creating
. Not a big deal, but I just want to confirm that this is expected.Here's what pods I currently have on the GPU node:
Is there anything I'm doing incorrectly here? Afaik it's not possible to remove the default nvidia driver from the cluster, as it's automatically injected by GKE. Please let me know if there's anything I could do to solve this, I'd love to start using your stuff. Thanks a lot of your time.
The text was updated successfully, but these errors were encountered: