You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It literally just force overwrites over the existing object. It now seems to own two replicasets when it only really wants one.
The replicasets are now bifurcated with their own respective versions, and they are owned by the new deployment which has been overwritten due to --force-conflicts :
We're quite fortunate that the upgrade did fail (despite xpk thinking it succeeded), because the jobs continued to get processed by the old controller-managers, while the new ones failed.
But if you look at the configuration of the controller managers, you can see that they are indeed leader-elected and I'm not sure that leader election would work across different versions of the CRD? It's probably tied to the replicaset, is my guess.
I think there is a bug in your upgrade process.
Looking at your code, it doesn't seem to be the case that the old kueue CRDs are deleted prior to the application of the new ones according to some of your prescriptions.
And you guys are waiting for the kueue deployment to be available https://github.com/AI-Hypercomputer/xpk/blob/main/src/xpk/core/kueue.py#L201-L202 which is fine I guess, but it turns out that when you force apply a new deployment of the controller-manager, it does this:
It literally just force overwrites over the existing object. It now seems to own two replicasets when it only really wants one.
The replicasets are now bifurcated with their own respective versions, and they are owned by the new deployment which has been overwritten due to
--force-conflicts
:Replicaset One:
Replicaset Two:
If you note, the owner references just point at the deployment, which doesn't seem to have an ID other than the uniqueness of the name.
Controlled By: Deployment/kueue-controller-manager
Due to this, this deployment-controller now owns two replicasets, each with their own versions for their pods:
Image: registry.k8s.io/kueue/kueue:v0.9.1
Image: registry.k8s.io/kueue/kueue:v0.8.1
and then you end up with these pods:
We're quite fortunate that the upgrade did fail (despite xpk thinking it succeeded), because the jobs continued to get processed by the old controller-managers, while the new ones failed.
But if you look at the configuration of the controller managers, you can see that they are indeed leader-elected and I'm not sure that leader election would work across different versions of the CRD? It's probably tied to the replicaset, is my guess.
The text was updated successfully, but these errors were encountered: