-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade-1-28-latest and upgrade-addons-before-controlplane-1-28-latest failed #2927
Comments
|
@neolit123 I suspect this is related to kubernetes/kubernetes#119156. The etcd manifest is regenerated but the same. The logic of manifest comparison is different with the one in kubelet. I may sync that here. Let me check. |
could be. but isn't v1beta4 not used in our CI yet? these jobs have to be run locally to test looking at the testgrids however, it seems these jobs have green runs on the 6th and 7th of Sept. which is confusing. |
Let me test locally. I will have a try. |
FTR, i was able to reproduce it locally:
i think the problem is that that the constructed k8s client cannot obtain the updated etcd pod from etcd.yaml. the next step is to try to understand why the kubelet is not starting the pod, or if it started it to see if its logs have errors. |
dumping some logs here. so while the CP1 upgrade was running trying some commands to see what is going on. not seeing any problems. error:
kubelet log:
old manifest:
manifest after upgrade:
the pod from the upgrade manifest is running:
etcd pod logs:
|
hmm, so the etcd pod actually changed: @@ -41,11 +41,10 @@ spec:
path: /health?exclude=NOSPACE&serializable=true
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
- successThreshold: 1
timeoutSeconds: 15
name: etcd
resources:
requests:
cpu: 100m
@@ -57,30 +56,22 @@ spec:
path: /health?serializable=false
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
- successThreshold: 1
timeoutSeconds: 15
- terminationMessagePath: /dev/termination-log
- terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
- dnsPolicy: ClusterFirst
- enableServiceLinks: true
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
- restartPolicy: Always
- schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
- terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
did sig node stop applying some defaults? |
but i think the hash change is failing because the etcd static pod annotation https://github.com/kubernetes/kubernetes/blob/d7aeb7f853e822071041bdbada7fcbe09d15062d/cmd/kubeadm/app/util/apiclient/wait.go#L230 i think we should look at recent kubelet changes. the stranger part is that the kube-apiserver, scheduler and kcm also use the same "check hash" during kubeadm upgrade, but it works for them. |
We may need to confirm if this is a kubelet behavior change or if this is a pod yaml generation behavior change with default values added. I had some problem running the e2e locally yesterday. |
I checked with lastest kubelet that if the manifest changes(like adding a default spec attribute: The container will be recreated. But the hash |
if the etcd version, flags, and thease defaults don't change, the hash will remain the same. i am trying to understand why is the logic even trying to apply this upgrade. technically we shouldn't reach this point. |
I prefer to add a logic in kubeadm to add those default values. The diff is something like
|
we could, if we want to hardcode certain values and be opinionated about certain behavior.
you mean this PR: confirms that our logic in kubeadm detects a diff and that's why it proceeds with the etcd.yaml upgrade?
we don't apply DNS policy to static pods anywhere in kubeadm: ~/go/src/k8s.io/kubernetes/cmd/kubeadm$ grep DNSPolicy * -rnI
~/go/src/k8s.io/kubernetes/cmd/kubeadm$ grep dnsPolicy * -rnI
app/phases/addons/dns/dns_test.go:747: dnsPolicy: Default
app/phases/addons/dns/dns_test.go:1013: dnsPolicy: Default
app/phases/addons/dns/manifests.go:146: dnsPolicy: Default
~/go/src/k8s.io/kubernetes/cmd/kubeadm$ grep ClusterFirst * -rnI
~/go/src/k8s.io/kubernetes/cmd/kubeadm$ EDITED |
never mind, looks like it has been ClusterFIrst for a long time: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults.go#L221-L226
i'm confused where this defaulting comes from in 1.29-pre... kubeadm does not consume the private types and their defaulting. https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3369-L3376 without defaulting. i am confused where these default v1.Pod values are coming from. |
To clarify the problem
I prefer to think kubelet works as expected here. The second is also work as expected as kubeadm always don't apply those defaulting values to the manifest YAML before. So now, it comes to the first. Why those defaulting values are added to the etcd manifests? |
understood,
did you find a client reason? technically to check if two static pod manifests are different we read them both from disk: i.e. we don't fetch the Pod from the API server (as a mirror pod of static pod). so if the new static pod has the defaulting kubeadm must be doing it, somehow...which is strange. |
we do use the apimachinery codecs for v1.Pod* though: read yaml file -> pass it to codec -> decode to know object (go struct) - e.g. v1.Pod https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/marshal.go#L58-L78 so if the decoder now has decided to apply defaults, we have a problem. |
I posted this to #api-machinery slack channel https://kubernetes.slack.com/archives/C0EG7JC6T/p1694264129815039. Let me do some generating test locally. |
reproduced it minimally, will post it on slack: package main
import (
"testing"
"github.com/pkg/errors"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/apimachinery/pkg/runtime/serializer"
clientsetscheme "k8s.io/client-go/kubernetes/scheme"
"sigs.k8s.io/yaml"
)
var input = []byte(string(`
apiVersion: v1
kind: Pod
metadata:
name: foo
`))
func unmarshalFromYamlForCodecs(buffer []byte, gv schema.GroupVersion, codecs serializer.CodecFactory) (runtime.Object, error) {
const mediaType = runtime.ContentTypeYAML
info, ok := runtime.SerializerInfoForMediaType(codecs.SupportedMediaTypes(), mediaType)
if !ok {
return nil, errors.Errorf("unsupported media type %q", mediaType)
}
decoder := codecs.DecoderToVersion(info.Serializer, gv)
obj, err := runtime.Decode(decoder, buffer)
if err != nil {
return nil, errors.Wrapf(err, "failed to decode %s into runtime.Object", buffer)
}
return obj, nil
}
func TestFunc(t *testing.T) {
t.Logf("\ninput:\n%s\n", input)
obj, err := unmarshalFromYamlForCodecs(input, v1.SchemeGroupVersion, clientsetscheme.Codecs)
if err != nil {
t.Fatalf("error: %v\n", err)
}
pod := obj.(*v1.Pod)
// t.Logf("%+v\n", pod)
output, err := yaml.Marshal(pod)
if err != nil {
t.Fatalf("error: %v\n", err)
}
t.Logf("\noutput:\n\n%s\n", output)
}
|
so above i have a mistake. it changed after 1.27 not after 1.28. EDIT: happened between tags but there is something else going on... i have no idea why our hash function is returning a matching manifest in those cases. |
1 and 2 merged. the e2e tests are still failing: this is before the diff tool update, but we are still getting the strange defaulter diff.
|
so weird😓 |
We can verify and cherry-pick patches to 1.28. |
The latest diff result(https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-1-28-latest/1701751770478284800/build-log.txt) shows that the default values are no longer injected, this is what we expect.
Now we should fix the v1.28 version, and then the CI will be green. I think after we remove the reference of func init() {
// We only register manually written functions here. The registration of the
// generated functions takes place in the generated files. The separation
// makes the code compile even when the generated files are missing.
localSchemeBuilder.Register(addDefaultingFuncs, addConversionFuncs)
} That's why the default values are injected... |
kubernetes/kubernetes#120605 EDIT: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1694575171227399 |
but if currently, the 1.28 kubeadm binary is producing a different manifest during upgrade due to the internal defaulters, that would mean the 1.27->1.28 upgrade must be failing as well? instead, it's currently green. edit: etcd version is the same: edit2: but an etcd upgrade is performed/ successful:
|
some testing from me confirms this is really go version related.
some testing in kinder shows the new manifest will not have the defaults generated, but old manifest for each of the pod has the defaults created. |
just tested with kinder this workflow and there is a diff: --- etcd.27.yaml 2023-09-13 18:03:01.839883253 +0300
+++ etcd.28.yaml 2023-09-13 17:55:56.203458105 +0300
@@ -32,7 +32,7 @@
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- image: registry.k8s.io/etcd:3.5.7-0
+ image: registry.k8s.io/etcd:3.5.9-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8 at the time of testing kubeadm 1.27 still has the 3.5.7 etcd: there is a pending backport for 3.5.9: but both the .27 and .28 etcd manifests do not have the defaults! ~/go/src/k8s.io/kubeadm/kinder$ cat etcd.28.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://172.17.0.2:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://172.17.0.2:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://172.17.0.2:2380
- --initial-cluster=kinder-upgrade-control-plane-1=https://172.17.0.2:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://172.17.0.2:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://172.17.0.2:2380
- --name=kinder-upgrade-control-plane-1
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: registry.k8s.io/etcd:3.5.9-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health?exclude=NOSPACE&serializable=true
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: etcd
resources:
requests:
cpu: 100m
memory: 100Mi
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /health?serializable=false
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}
unclear to me if kubernetes/kubernetes#120605 will fix anything. |
i saw go version diff as well at some point. go 1.21 was OK, go 1.20 generated defaults for the TestFunc in this ticket. but kubeadm at the 1.28 branch is still built with 1.20:
and it's not generating defaults as i showed above. |
kubernetes/release#3076 is in progress. |
The default values will only be injected when |
hmm, how so? |
in Kinder? IIRC, the default will be not injected when I init my cluster on Centos directly. |
You can try the old v1.28 version: wget https://storage.googleapis.com/k8s-release-dev/ci/v1.28.2-1+a68748c7cd04f2/bin/linux/amd64/kubeadm
chmod +x kubeadm
kubeadm init ... BTW, kubernetes/kubernetes#120605 is merged. |
/close https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-1-28-latest/1702296617098416128 |
@pacoxu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
i don't see defaults after @chendave 's cherry pick merged: it's
but i did not see defaults with the old 1.28 binary too:
|
@neolit123 @pacoxu You can try to reproduce it by: docker pull kindest/base:v20221102-76f15095
kinder build node-image-variant --base-image=kindest/base:v20221102-76f15095 --image=kindest/node:test --with-init-artifacts=v1.28.2-1+a68748c7cd04f2 --with-upgrade-artifacts=v1.29.0-alpha.0.802+a68093a3ffb552 --loglevel=debug
kinder create cluster --name=kinder-upgrade --image=kindest/node:test --control-plane-nodes=1 --worker-nodes=1 --loglevel=debug
kinder do kubeadm-init --name=kinder-upgrade --copy-certs=auto --loglevel=debug --kubeadm-verbosity=6
echo "---------------------------------------------"
echo "----Old v1.28 kubeadm generated etcd.yaml----"
echo "---------------------------------------------"
docker exec kinder-upgrade-control-plane-1 cat /etc/kubernetes/manifests/etcd.yaml
kinder delete cluster --name=kinder-upgrade
kinder build node-image-variant --base-image=kindest/base:v20221102-76f15095 --image=kindest/node:test --with-init-artifacts=v1.28.2-7+728862ded5e9d8 --with-upgrade-artifacts=v1.29.0-alpha.0.806+4abf29c5c86349 --loglevel=debug
kinder create cluster --name=kinder-upgrade --image=kindest/node:test --control-plane-nodes=1 --worker-nodes=1 --loglevel=debug
kinder do kubeadm-init --name=kinder-upgrade --copy-certs=auto --loglevel=debug --kubeadm-verbosity=6
echo "---------------------------------------------"
echo "----New v1.28 kubeadm generated etcd.yaml----"
echo "---------------------------------------------"
docker exec kinder-upgrade-control-plane-1 cat /etc/kubernetes/manifests/etcd.yaml
kinder delete cluster --name=kinder-upgrade The output is as follows: ...
---------------------------------------------
----Old v1.28 kubeadm generated etcd.yaml----
---------------------------------------------
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://172.17.0.3:2379/
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://172.17.0.3:2379/
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://172.17.0.3:2380/
- --initial-cluster=kinder-upgrade-control-plane-1=https://172.17.0.3:2380/
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379/,https://172.17.0.3:2379/
- --listen-metrics-urls=http://127.0.0.1:2381/
- --listen-peer-urls=https://172.17.0.3:2380/
- --name=kinder-upgrade-control-plane-1
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: registry.k8s.io/etcd:3.5.9-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health?exclude=NOSPACE&serializable=true
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
name: etcd
resources:
requests:
cpu: 100m
memory: 100Mi
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /health?serializable=false
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}
...
---------------------------------------------
----New v1.28 kubeadm generated etcd.yaml----
---------------------------------------------
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://172.17.0.4:2379/
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://172.17.0.4:2379/
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://172.17.0.4:2380/
- --initial-cluster=kinder-upgrade-control-plane-1=https://172.17.0.4:2380/
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379/,https://172.17.0.4:2379/
- --listen-metrics-urls=http://127.0.0.1:2381/
- --listen-peer-urls=https://172.17.0.4:2380/
- --name=kinder-upgrade-control-plane-1
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: registry.k8s.io/etcd:3.5.9-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health?exclude=NOSPACE&serializable=true
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: etcd
resources:
requests:
cpu: 100m
memory: 100Mi
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /health?serializable=false
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
status: {} |
Therefore, if a user initializes the cluster with old v1.28 kubeadm (<1.28.2), they may encounter problems when upgrading to v1.29. |
strange, in my test here: i runed against the older v1.28.2-1+a68748c7cd04f2 and did not get defaults.
you are running the same "old" version but getting a different etcd.yaml. |
@neolit123 Emm... I don't really understand.
But kubeadm version: # ./kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.2-1+a68748c7cd04f2", GitCommit:"a68748c7cd04f2462352afb05ba31f06fc799595", GitTreeState:"clean", BuildDate:"2023-09-13T09:54:55Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}
root@kinder-upgrade-control-plane-1:/# Perhaps related to the execution path? |
no idea. init is technically calling the same phase. |
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-28-latest
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-addons-before-controlplane-1-28-latest
keeps failing after kubernetes/release#3254.
/assign
ref #2925
See #2927 (comment) for the conclusion.
Option1: if we have to revert(importthe kubernetes/pkg import is not allowed in kubeadm!
)revert Revert #118867: for default unmarshal behavior chang kubernetes#120554 in v1.29 (master)cherry-pick to v1.28 if neededThe text was updated successfully, but these errors were encountered: