Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude image-field from notebook controller reconciliation for all containers present in annotation #72

Closed

Conversation

shalberd
Copy link

@shalberd shalberd commented Mar 17, 2023

fixes kubeflow/notebooks#98

superseded by PR against v1.7-branch in PR-133
Looking for feedback before thinking of merging. AFIK, this should be pretty close to being mergeable, though.

seems to build alright locally, except unclear to me how to update the common/components, where reconcilehelper lies.
Got an error that is probably easy to resolve and dependent on the build chain:

kubeflow/kubeflow#13 51.49 controllers/notebook_controller.go:213:58: too many arguments in call to "github.com/kubeflow/kubeflow/components/common/reconcilehelper".CopyStatefulSetFields                                                                                                     
kubeflow/kubeflow#13 51.49 	have (*"k8s.io/api/apps/v1".StatefulSet, *"k8s.io/api/apps/v1".StatefulSet, bool, []string, logr.Logger)
kubeflow/kubeflow#13 51.49 	want (*"k8s.io/api/apps/v1".StatefulSet, *"k8s.io/api/apps/v1".StatefulSet)

that Problem was resolved by referencing our version of the reconcilehelper utilities-module under /components/common in notebook controlller go.mod:

// use our version of kubeflow/components/common module for reconcilehelper utility differences related to image-field reconciliation and affected containers exclude
replace (
	github.com/kubeflow/kubeflow/components/common => ../common
)

Key idea:

Updating notebook controller logic and reconcilehelper util logic to take into account optional Openshift image policy admission plug-in image-change trigger annotation with one or more containers and their related imagestreams (json array) that leads to container image-field update and to avoid overwriting the updated image-field value by the notebook controller.

Discussed problem with @bparees , received input from @VaishnaviHire, member of Open Data Hub (a RedHat project for Openshift) team. Also discussed how to resolve the issue in Notebook reconciler with Ben.

Context: Notebook yaml contains the (optional) metadata annotation

 annotations:
    image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"s2i-generic-data-science-notebook:v0.0.5", "namespace":"odhsven"},"fieldPath":"spec.template.spec.containers[?(@.name==\"jupyter-nb-sven\")].image"}]'

referencing imagestream name and tag, imagestream namespace, and fieldPath reference to the image-value to be replaced by the image policy admission plug-in for a given container name. This annotation can reference one or more containers, in the form of a json array. This image-name resolve by the image policy admission plug-in happens in basic Kubernetes objects, not in the Notebook yaml itself, but in the StatefulSet derived from the Notebook yaml. The resolved image-name is set briefly after application of the StatefulSet to the server.

Or, referencing the openshift documentation:

When one of the core Kubernetes resources contains both a pod template and this annotation, OpenShift Container Platform attempts to update the object by using the image currently associated with the image stream tag that is referenced by trigger. The update is performed against the fieldPath specified.

To ensure that image-fields referenced via this annotation in the StatefulSet are not overwritten by the kubeflow notebook controller reconcile process, we set image-fields-values before change and after change to be equal, thus leading to DeepEqual returning false even when the image-field value changes.

Because the annotation is optional, checks are introduced in the code to ensure any custom logic is only done when the annotation is present.

Background:

making image-field value independent of internal openshift registry and supporting ImageContentSourcePolicy for Openshift environments with a third-party Docker repo, like VMWare Harbor. This is archieved by using the imagestream abstraction. Referencing imagestream name and tag instead of image-digest-values directly. Imagestreams are kind of an abstraction in openshift.

See what happens to the image-field cause of the image policy admission plug-in doing its work:

  • Without open shift registry:

<image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) always resolve to external image reference from imagestream spec.tags[tagname].from.name regardless of whether spec.tags[tagname].referencePolicy.type is set to Local or Source in image stream--> perfect for air gapped and ImageContentSourcePolicy use or if one just wants to pull directly from the external location when there is no internal openshift registry.

containers:
    - name: testissource
      image: >-
         quay.io/thoth-station/s2i-generic-data-science notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
  • With openshift registry:

if spec.tags[tagname].referencePolicy.type of the imagestream is set to Source then <image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) resolves to the remote repo location:

containers:
    - name: testissource
      image: >-
         quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58

if spec.tags[tagname].referencePolicy.type of the imagestream is set to Local then <image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) resolves to the namespace-specific location of the image in the open shift-internal registry:

containers
    - name: testisreflocal
      image: >-
        image-registry.openshift-image-registry.svc:5000/testis/jupyter-tensorflow-notebook@sha256:fc52e4fbc8c1c70dfa22dbfe6b0353f5165c507c125df4438fca6a3f31fe976e

@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2023

Hi @shalberd. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from LaVLaS and VaishnaviHire March 17, 2023 12:46
@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vaishnavihire for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shalberd
Copy link
Author

shalberd commented Mar 17, 2023

For functional testing, assuming the ci tests run through alright, I will take the notebook controller image from this PR, quay.io/opendatahub/kubeflow-notebook-controller:pr-72], use it on our OCP 4.10 cluster in the manifests, manually apply a Notebook CR to the cluster as proposed in odh-dashboard PR opendatahub-io/odh-dashboard#800 and see if the image-field gets updated accordingly. I will document what I did here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase The PR needs a rebase or there are conflicts label Mar 17, 2023
…tatefulset if present in notebook, to exclude image-fields in compare, to set image-field to new value in reconcile
@shalberd shalberd force-pushed the v1.6-branch-unmodified branch from 3c5297c to 2497bf2 Compare March 17, 2023 16:20
@openshift-merge-robot openshift-merge-robot removed the needs-rebase The PR needs a rebase or there are conflicts label Mar 18, 2023
@VaishnaviHire
Copy link
Member

/ok-to-test

@VaishnaviHire
Copy link
Member

@shalberd Since an existing function is updated here, we would need to update the unit tests. Also can we add unit tests for this change ?

@shalberd
Copy link
Author

regarding tests, absolutely, yes. Just give me a pointer where I need to go to, please.

…per utility differences related to image-field reconciliation and affected containers exclude
…ents passed as key-value-pairs for logging and controller restarting by correctly using go-logr (msg, keysAndValues...)
@shalberd
Copy link
Author

shalberd commented Mar 21, 2023

It worked. Tested on an OCP 4.10.43 Cluster without internal Openshift Registry. Used Manifests and Operator Version 1.4.2.
Modified notebook controller image-tag in manifests to test https://github.com/opendatahub-io/odh-manifests/blob/master/odh-notebook-controller/kf-notebook-controller/overlays/openshift/kustomization.yaml#L14

to

quay.io/opendatahub/kubeflow-notebook-controller:pr-72

Modified a Notebook CR assembled by odh-dashboard to include the image change trigger annotation.

Bildschirmfoto 2023-03-21 um 21 57 07

Image-field-value of the container in the notebook spec can be anything, it does not matter, could even be empty.

Bildschirmfoto 2023-03-21 um 22 24 24

The generated StatefulSet has the same annotation now and the image-field-value is resolved correctly by the image policy admission plug-in from the imagestream name and tag in the annotation to imagestream.tag.from.name, as I do not have an internal openshift registry.

Bildschirmfoto 2023-03-21 um 20 20 40

Bildschirmfoto 2023-03-21 um 21 59 43

And, most important, the StatefulSet container image field value stays that way :-)

Bildschirmfoto 2023-03-21 um 21 28 40

@shalberd
Copy link
Author

shalberd commented Mar 21, 2023

Now thinking of a good reconciliation unit test at https://github.com/opendatahub-io/kubeflow/blob/v1.6-branch/components/notebook-controller/controllers/notebook_controller_test.go .... not with regards to notebook status, but to notebook-derived statefulset image-field pre-reconcile vs. post-reconcile.

@VaishnaviHire I don't really see any detailed testing logic regarding notebook controller reconciliation utils. How critical is it having them? Tested the built kubeflow-notebook-controller in conjunction with odh dashboard and the new notebook Cr assembly logic, working well. That is, the original placeholder-value of the container image field is not kept as the source of truth and the updated container image field value is set as the new desired value, leading to reconciliation not overwriting the image-value that the openshift image policy plug-in set based on the annotation.

@LaVLaS
Copy link

LaVLaS commented Mar 29, 2023

@shalberd Was the original issue caused by the application of image.openshift.io/triggers on the StatefulSet by k8s and, if so, why does the same annotation need to be applied manually to the Notebook CR also

@shalberd
Copy link
Author

shalberd commented Mar 29, 2023

@shalberd Was the original issue caused by the application of image.openshift.io/triggers on the StatefulSet by k8s and, if so, why does the same annotation need to be applied manually to the Notebook CR also

no, the original issue was not directly caused by the application of image.openshift.io.triggers on the StatefulSet by K8S.
It was caused by the image-field of the container in the StatefulSet PodSpec (based on Notebook PodSpec) being overwitten to its original non-resolved value because the kubernetes-based notebook controller does not know about the openshift ImagePolicy admission plug-in.

The image.openshift.io/triggers annotation with the help of the openshift ImagePolicy admission plug-in fills in the image-field value of the container correctly after the StatefulSet is applied, but then, the notebook controller overwrites it to the original value on reconciling StatefulSet new values and original values. This makes sense, though, because Kubernetes does not know the concept of image trigger admission controllers and imagestreams. Background discussion with Ben Parees, Senior Principal Software Engineer at RedHat

The annotation needs to be applied manually to the Notebook CR also because we do not build StatefulSet CRs, we assemble and submit Notebook CRs, that the contoller reacts to and assembles a StatefulSet from.

Now, here is the historical flow of events leading to this solution:

  • ODH Dashboard uses imagestreams to get all the information it needs to assemble a Notebook CR yaml. The kubeflow notebook controller assembles a StatefulSet based on the Notebook CR it finds.
  • ODH Dashboard currently puts into the the image-field-value the value of Imagestream status.dockerImageRepository for the image name, while taking the imagestream tag for the image tag value. Since in environments without internal openshift registry, imagestream status.dockerImageRepository is empty (there is no internal repository), the image-content of the Notebook pod def container is half-empty, missing the image name
  • we came to the conclusion that referencing the imagestream name and tag directly via the image.openshift.io.triggers annotation is the way for being independent of the internal openshift registry. We let the openshift image policy admission controller decide whether to reference the internal openshift registry location or the external image location when filling in the container-image-field with the correct value. That triggers annotation works on StatefulSets, Pods, Deployments, DeploymentConfigs, Pods ... any basic Kubernetes object. It is Openshift-specific, though, as imagestreams are :-)
  • there are basically two parts of this story:
  1. ODH Dashboard does not reference the openshift-internal image location directly anymore when assembling the notebook CR. It uses the image trigger annotation. See stories [Bug]: NotebookUtils assembleNotebook image: dependent on ImageStream dockerImageRepo / internal Docker Repo odh-dashboard#792 and [Bug]: Error: InvalidImageName Failed to apply default image tag ":py3.8-v1": couldn't parse image reference ":py3.8-v1": invalid reference format odh-dashboard#911 for initial problem and linked PR 800 of ODH dashboard.
  2. Notebook Controller still takes all elements and values of the PodSpec when assembling a StatefulSet. However, it disregards overwriting the image-field to its pre-filled-out-value for all containers mentioned in the image-openshift.io/trigers annotation and completed and resolved by the openshift image policy admission plugin.

@shalberd
Copy link
Author

End-to-end test here:

opendatahub-io/odh-dashboard#800 (comment)

@VaishnaviHire
Copy link
Member

/hold

Holding this PR until we test a scenario for long running notebooks, and how this change will affect Notebook restarts

@openshift-ci openshift-ci bot added the do-not-merge/hold Do not merge this PR label Apr 4, 2023
@shalberd
Copy link
Author

shalberd commented Apr 5, 2023

Holding this PR until we test a scenario for long running notebooks, and how this change will affect Notebook restarts

@VaishnaviHire @lucferbux @andrewballantyne @LaVLaS
Some thoughts regarding the current situation and how you wish to avoid container restarts during dynamic development / weekly images changes per a given imagestream tag:

First of all, we assume imagestream spec.lookupPolicy.local: true, as it is in all our imagestreams.

To your question on what happens if the tag (of the image underlying the imagestream tag) moves, that is, if the SHA256 digest of an image tag changes:

It depends on which tag.referencePolicy.type you use in the imagestream. You can set that per-tag.

If it is set to "Source", then the container is created from the external tag url, so the new image behind the tag is always used. Pod parameter imagePullPolicy: Always is important in this case otherwise the old image may be cached on the node by the container runtime.

If set to "Local", then the ImageStream tag points to the sha256 ID of the image, so the container is created from the same image even if the external tag moves. In contrast, using --scheduled=true or manually refreshing the ImageStream (e.g. via oc apply of the imagestream) will update the sha256 ID it refers to, can be seen under imagestream status section for all tags. ImageStreams can periodically (15 mins) monitor if the external tag changes by setting tag[x]. importPolicy: {scheduled: true}

By default, imagestream.tag importpolicy is set to importPolicy: {} non-scheduled.

I think most of your imagestreams, except Elyra in the overlay, use tag.referencePolicy.type: Local.
You don't use importPolicy: {scheduled: true} on the tags, so the status-info in the imagestream for a given imagestream tag should always point to the same image digest, even when the image behind an external image tag changes.

Bildschirmfoto 2023-03-27 um 15 35 21

That is the current situation. About triggering updates on imagestream changes (the new annotation way):

When one of the core Kubernetes resources (e.g StatefulSet) contains both a pod template and this annotation, OpenShift Container Platform attempts to update the object by using the image currently associated with the image stream tag that is referenced by trigger. Meaning imagestream.status.tag[x].items.image.

So, currently, unless you have tag.referencePolicy.Type: Source or you re-apply an imagestream yaml as a whole or you set tag[x]. importPolicy: {scheduled: true}, the status-section of the imagestream and the tag[x].items.image sha256 digest should always stay the same when the image behind the -weekly build tag changes. That holds true for the new image change trigger annotation, too.

As a whole, it really depends on what you want to achieve.

Currently, your -weekly images in the imagestream tags for CICD have referencePolicy: type: Local.

Infos compiled from

https://itnext.io/variations-on-imagestreams-in-openshift-4-f8ee5e8be633

https://docs.openshift.com/container-platform/4.8/openshift_images/triggering-updates-on-imagestream-changes.html#images-triggering-updates-imagestream-changes-kubernetes-about_triggering-updates-on-imagestream-changes

Essentially, on any given openshift cluster, with the way you use the dynamic -weekly imagestream tags right now, your containers will stay stable and refer to the old digest pre-update of the -weekly external image.
As soon as you apply or update anything on the imagestream itself, though, the digests will change. Like for example on a newly-spun-up cluster. This is what happened in our tests when we had the call together, where I changed a from.name image-reference, then saved the imagestream yaml. In that case, of course the digest changed.

Let me know how your running notebooks behave once an image changes in -weekly (digest and last-update time changes in quai.io). They should stay stable and not restart, provided everything else on the cluster is unchanged, including the imagestream yaml.

@VaishnaviHire
Copy link
Member

@shalberd I might be missing some steps, I tested this with quay.io/opendatahub/kubeflow-notebook-controller:pr-72 image. The changes you added worked. When I add annotation to Notebook instance, it gets added to Statefulset

kind: StatefulSet
apiVersion: apps/v1
metadata:
  annotations:
    image.openshift.io/triggers: >-
      [{"from":{"kind":"ImageStreamTag","name":"jupyter-minimal-notebook:py3.8-v1",
      "namespace”:”opendatahub”},”fieldPath":"spec.template.spec.containers[?(@.name==\"jupyter-nb-kube-3aadmin\")].image"}]

However, the image is not updated

Podspec

      image: >-
        image-registry.openshift-image-registry.svc:5000/opendatahub/jupyter-minimal-notebook:py3.9-v2

@shalberd
Copy link
Author

shalberd commented Apr 28, 2023

@VaishnaviHire when you add the annotation manually to a Notebook CR instance, it is important to check what the name of that first container is.
fieldPath":"spec.template.spec.containers[?(@.name==\"jupyter-nb-kube-3aadmin\"

is the name of the notebook container, the one where you then checked the image-field in the created statefulset, really jupyter-nb-kube-3aadmin? I mean, that name is variable, could be it is jupyter-nb-vaishnavi or something`. In ODH Dashboard PR code, the dashboard makes sure that name is exactly as in the containers part of the notebook yaml.

In any case, it has to be the same as the container name field value. fieldPath in the annotation is kind of a lookup pointing to what image-field to update (of which container, by container name).

@VaishnaviHire
Copy link
Member

@VaishnaviHire when you add the annotation manually to a Notebook CR instance, it is important to check what the name of that first container is. fieldPath":"spec.template.spec.containers[?(@.name==\"jupyter-nb-kube-3aadmin\"

is the name of the notebook container, the one where you then checked the image-field in the created statefulset, really jupyter-nb-kube-3aadmin? I mean, that name is variable, could be it is jupyter-nb-vaishnavi or something`. In ODH Dashboard PR code, the dashboard makes sure that name is exactly as in the containers part of the notebook yaml.

In any case, it has to be the same as the container name field value. fieldPath in the annotation is kind of a lookup pointing to what image-field to update (of which container, by container name).

Yes it matches with my container name. I also do not see the log message defined here.
Re-triggering image build CI job.

@VaishnaviHire
Copy link
Member

/test kf-notebook-controller-pr-image-mirror

@shalberd
Copy link
Author

shalberd commented May 1, 2023

Also, what about the namespace in the annotation? Is it really the namespace of the main namespace where e.g. odh dashboard and notebook controller and imagestreams live in? In my case, I was using a namespace called opendatahub, but I know sometimes your default is e.g. odh.

There are also wrongly-formatted quotation marks surrounding opendatahub and before fieldPath in the snippet you pasted. Sorry I did not see that before in your snippet you pasted:

" vs

"namespace”:”opendatahub”},”fieldPath"

If you test this without odh-dashboard:pr-800, then it is also important you take a notebook def not managed by odh dahsboard, I think. Just make a copy and name the notebook CR sligthly differently. But, regarding this point, I do not think it is that critical. You mentioned you are manually testing on own Notebook CRs anyways.

Try out the quotation marks being correct " everywhere in the annotation.

@shalberd
Copy link
Author

/retest

@shalberd
Copy link
Author

shalberd commented Jul 3, 2023

/retest

@shalberd
Copy link
Author

shalberd commented Jul 3, 2023

@harshad16 this change lets notebook controller reconcile all elements of a notebook / statefulset pod spec when no image change trigger annotation is present. That is, old notebooks assembled the old way will still work as expected.

For newly created notebooks in ODH Dashboard PR 800 way, the image change triggger annotation admission plugin of openshift will handle setting the value of container image field. When there is no internal openshift registry, the value will always be the sha256 digest location and path from the external registry, e.. quay.io. When there is an internal openshift registry and tag referencePolicy is set to Local, the internal registry location will be used. When there is an internal openshift registry and tag referencePolicy is set to Source, the sha256 digest location and path from the external registry, e.. quay.io, will be used.

The changes in this PR here make sure that the original placeholder value for the pod container image field from Notebook CR does not come back to Statefulset and therefore pod-container image field after image change trigger admission plugin did its work resolving the image-location.

@VaishnaviHire thank you, too, for all the help and having looked around on this, especially with regards to long-running notebooks.

@shalberd
Copy link
Author

shalberd commented Aug 3, 2023

@shalberd shalberd closed this Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants