Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

istiod not found #549

Closed
natalytvinova opened this issue Sep 5, 2024 · 18 comments
Closed

istiod not found #549

natalytvinova opened this issue Sep 5, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@natalytvinova
Copy link

Bug Description

Hi team,
I'm facing an istio-pilot unit fail with hook-failed "ingress-relation-changed".

Bundle with overlays:
storage-overlay.yaml.txt
mlflow-overlay.yaml.txt
auth-overlay.yaml.txt
kubeflow.yaml.txt

To Reproduce

  1. juju deploy --overlay auth-overlay.yaml --overlay mlflow-overlay.yaml --overlay storage-overlay.yaml ./kubeflow.yaml --trust

Environment

  1. juju 3.5.3
  2. Kubernetes: AKS version 1.29

Relevant Log Output

2024-09-05 12:38:05,119 DEBUG    ops 2.15.0 up and running.
2024-09-05 12:38:05,132 DEBUG    load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
2024-09-05 12:38:05,136 DEBUG    load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
2024-09-05 12:38:05,141 DEBUG    connect_tcp.started host='30.0.0.1' port=443 local_address=None timeout=None socket_options=None
2024-09-05 12:38:05,151 DEBUG    connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f722624f580>
2024-09-05 12:38:05,155 DEBUG    start_tls.started ssl_context=<ssl.SSLContext object at 0x7f722627da40> server_hostname='30.0.0.1' timeout=None
2024-09-05 12:38:05,174 DEBUG    start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f722624f370>
2024-09-05 12:38:05,178 DEBUG    send_request_headers.started request=<Request [b'GET']>
2024-09-05 12:38:05,182 DEBUG    send_request_headers.complete
2024-09-05 12:38:05,185 DEBUG    send_request_body.started request=<Request [b'GET']>
2024-09-05 12:38:05,189 DEBUG    send_request_body.complete
2024-09-05 12:38:05,193 DEBUG    receive_response_headers.started request=<Request [b'GET']>
2024-09-05 12:38:05,197 DEBUG    receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Audit-Id', b'bbcc2695-0f63-432a-9191-e448474a09c2'), (b'Cache-Control', b'no-cache, private'), (b'Content-Type', b'application/json'), (b'X-Kubernetes-Pf-Flowschema-Uid', b'84c227cf-9203-4880-b5bb-f6fe52b4c249'), (b'X-Kubernetes-Pf-Prioritylevel-Uid', b'605c7632-f2f9-478b-ba8d-e09792c9847c'), (b'Date', b'Thu, 05 Sep 2024 12:38:05 GMT'), (b'Content-Length', b'1976')])
2024-09-05 12:38:05,202 INFO     HTTP Request: GET https://30.0.0.1/api/v1/namespaces/kubeflow/services/istio-ingressgateway-workload "HTTP/1.1 200 OK"
2024-09-05 12:38:05,205 DEBUG    receive_response_body.started request=<Request [b'GET']>
2024-09-05 12:38:05,209 DEBUG    receive_response_body.complete
2024-09-05 12:38:05,213 DEBUG    response_closed.started
2024-09-05 12:38:05,217 DEBUG    response_closed.complete
2024-09-05 12:38:05,223 WARNING  0 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-09-05 12:38:05,227 WARNING  Invalid Grafana dashboards folder at /var/lib/juju/agents/unit-istio-pilot-0/charm/src/grafana_dashboards: directory does not exist
2024-09-05 12:38:05,280 DEBUG    Emitting Juju event ingress_relation_changed.
2024-09-05 12:38:05,395 INFO     Rendering manifests
2024-09-05 12:38:05,399 DEBUG    Rendering with context: {'auth_service_name': 'oidc-gatekeeper', 'auth_service_namespace': 'kubeflow', 'app_name': 'istio-pilot', 'envoyfilter_name': 'istio-pilot-authn-filter', 'envoyfilter_namespace': 'kubeflow', 'gateway_ports': [8080, 8443], 'port': 8080, 'request_headers': ['cookie', 'X-Auth-Token'], 'response_headers': ['kubeflow-userid']}
2024-09-05 12:38:05,403 DEBUG    Rendering manifest for src/manifests/auth_filter.yaml.j2
2024-09-05 12:38:05,410 DEBUG    Rendered manifest:
---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: istio-pilot-authn-filter
  namespace: kubeflow
spec:
  configPatches:
  
    - applyTo: HTTP_FILTER
      match:
        context: GATEWAY
        listener:
          portNumber: 8080
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        # For some reason, INSERT_FIRST doesn't work
        operation: INSERT_BEFORE
        value:
          # See: https://www.envoyproxy.io/docs/envoy/v1.17.0/configuration/http/http_filters/ext_authz_filter#config-http-filters-ext-authz
          name: "envoy.filters.http.ext_authz"
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
            http_service:
              server_uri:
                uri: http://oidc-gatekeeper.kubeflow.svc.cluster.local:8080
                cluster: outbound|8080||oidc-gatekeeper.kubeflow.svc.cluster.local
                timeout: 10s
              authorization_request:
                allowed_headers:
                  patterns:
                    
                    - exact: cookie
                    
                    - exact: X-Auth-Token
                    
              authorization_response:
                allowed_upstream_headers:
                  patterns:
                      
                      - exact: kubeflow-userid
                      
  
    - applyTo: HTTP_FILTER
      match:
        context: GATEWAY
        listener:
          portNumber: 8443
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        # For some reason, INSERT_FIRST doesn't work
        operation: INSERT_BEFORE
        value:
          # See: https://www.envoyproxy.io/docs/envoy/v1.17.0/configuration/http/http_filters/ext_authz_filter#config-http-filters-ext-authz
          name: "envoy.filters.http.ext_authz"
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
            http_service:
              server_uri:
                uri: http://oidc-gatekeeper.kubeflow.svc.cluster.local:8080
                cluster: outbound|8080||oidc-gatekeeper.kubeflow.svc.cluster.local
                timeout: 10s
              authorization_request:
                allowed_headers:
                  patterns:
                    
                    - exact: cookie
                    
                    - exact: X-Auth-Token
                    
              authorization_response:
                allowed_upstream_headers:
                  patterns:
                      
                      - exact: kubeflow-userid
                      
  

  workloadSelector:
    labels:
      istio: ingressgateway
2024-09-05 12:38:05,420 DEBUG    Applying 1 resources
2024-09-05 12:38:05,424 DEBUG    load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
2024-09-05 12:38:05,428 DEBUG    load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
2024-09-05 12:38:05,433 DEBUG    Creating <class 'lightkube.generic_resource.EnvoyFilter'> istio-pilot-authn-filter...
2024-09-05 12:38:05,437 DEBUG    connect_tcp.started host='30.0.0.1' port=443 local_address=None timeout=None socket_options=None
2024-09-05 12:38:05,443 DEBUG    connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f722620c9d0>
2024-09-05 12:38:05,446 DEBUG    start_tls.started ssl_context=<ssl.SSLContext object at 0x7f72261b8940> server_hostname='30.0.0.1' timeout=None
2024-09-05 12:38:05,467 DEBUG    start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f722620ce20>
2024-09-05 12:38:05,471 DEBUG    send_request_headers.started request=<Request [b'PATCH']>
2024-09-05 12:38:05,476 DEBUG    send_request_headers.complete
2024-09-05 12:38:05,479 DEBUG    send_request_body.started request=<Request [b'PATCH']>
2024-09-05 12:38:05,483 DEBUG    send_request_body.complete
2024-09-05 12:38:05,487 DEBUG    receive_response_headers.started request=<Request [b'PATCH']>
2024-09-05 12:38:05,495 DEBUG    receive_response_headers.complete return_value=(b'HTTP/1.1', 500, b'Internal Server Error', [(b'Audit-Id', b'0739aa27-4702-41dd-b711-00d649ab8852'), (b'Cache-Control', b'no-cache, private'), (b'Content-Type', b'application/json'), (b'X-Kubernetes-Pf-Flowschema-Uid', b'84c227cf-9203-4880-b5bb-f6fe52b4c249'), (b'X-Kubernetes-Pf-Prioritylevel-Uid', b'605c7632-f2f9-478b-ba8d-e09792c9847c'), (b'Date', b'Thu, 05 Sep 2024 12:38:05 GMT'), (b'Content-Length', b'515')])
2024-09-05 12:38:05,499 INFO     HTTP Request: PATCH https://30.0.0.1/apis/networking.istio.io/v1alpha3/namespaces/kubeflow/envoyfilters/istio-pilot-authn-filter?force=true&fieldManager=lightkube "HTTP/1.1 500 Internal Server Error"
2024-09-05 12:38:05,502 DEBUG    receive_response_body.started request=<Request [b'PATCH']>
2024-09-05 12:38:05,506 DEBUG    receive_response_body.complete
2024-09-05 12:38:05,511 DEBUG    response_closed.started
2024-09-05 12:38:05,514 DEBUG    response_closed.complete
2024-09-05 12:38:05,808 ERROR    Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'https://30.0.0.1/apis/networking.istio.io/v1alpha3/namespaces/kubeflow/envoyfilters/istio-pilot-authn-filter?force=true&fieldManager=lightkube'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 1209, in <module>
    main(Operator)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/main.py", line 551, in main
    manager.run()
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/main.py", line 530, in run
    self._emit()
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/main.py", line 519, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 321, in reconcile
    self._reconcile_ingress_auth(ingress_auth_data)
  File "./src/charm.py", line 747, in _reconcile_ingress_auth
    krh.apply()
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 363, in apply
    raise e
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 336, in apply
    apply_many(
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/charmed_kubeflow_chisme/lightkube/batch/_many.py", line 72, in apply_many
    returns[i] = client.apply(
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/client.py", line 457, in apply
    return self.patch(type(obj), name, obj, namespace=namespace,
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/client.py", line 325, in patch
    return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
lightkube.core.exceptions.ApiError: Internal error occurred: failed calling webhook "validation.istio.io": failed to call webhook: Post "https://istiod.kubeflow-dev.svc:443/validate?timeout=10s": service "istiod" not found


Juju status:

ubuntu@jumphost:~/cpe-deployments/config/kubeflow$ juju status
Model     Controller                  Cloud/Region                Version  SLA          Timestamp
kubeflow  aks-cos-germanywestcentral  aks-dev/germanywestcentral  3.5.3    unsupported  12:50:21Z

App                      Version                  Status       Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                 active           1  admission-webhook        1.9/stable       344  30.0.155.84   no       
argo-controller                                   active           1  argo-controller          3.4/stable       545  30.0.245.19   no       
dex-auth                                          active           1  dex-auth                 2.39/stable      548  30.0.242.68   no       
envoy                                             active           1  envoy                    2.2/stable       263  30.0.5.162    no       
istio-ingressgateway                              active           1  istio-gateway            1.22/stable     1218  30.0.39.6     no       
istio-pilot                                       error            1  istio-pilot              1.22/stable     1169  30.0.6.84     no       hook failed: "ingress-relation-changed"
jupyter-controller                                active           1  jupyter-controller       1.9/stable      1038  30.0.66.139   no       
jupyter-ui                                        active           1  jupyter-ui               1.9/stable       961  30.0.148.215  no       
katib-controller                                  active           1  katib-controller         0.17/stable      750  30.0.239.111  no       
katib-db                 8.0.37-0ubuntu0.22.04.3  active           3  mysql-k8s                8.0/stable       180  30.0.49.190   no       
katib-db-manager                                  active           1  katib-db-manager         0.17/stable      713  30.0.201.150  no       
katib-ui                                          active           1  katib-ui                 0.17/stable      713  30.0.124.216  no       
kfp-api                                           active           1  kfp-api                  2.2/stable      1552  30.0.170.202  no       
kfp-db                   8.0.37-0ubuntu0.22.04.3  active           3  mysql-k8s                8.0/stable       180  30.0.40.56    no       
kfp-metadata-writer                               active           1  kfp-metadata-writer      2.2/stable       617  30.0.132.227  no       
kfp-persistence                                   maintenance      1  kfp-persistence          2.2/stable      1560  30.0.115.149  no       Reconciling charm: executing component container:persistenceagent
kfp-profile-controller                            active           1  kfp-profile-controller   2.2/stable      1518  30.0.69.43    no       
kfp-schedwf                                       active           1  kfp-schedwf              2.2/stable      1571  30.0.246.25   no       
kfp-ui                                            active           1  kfp-ui                   2.2/stable      1555  30.0.53.32    no       
kfp-viewer                                        active           1  kfp-viewer               2.2/stable      1586  30.0.79.39    no       
kfp-viz                                           active           1  kfp-viz                  2.2/stable      1504  30.0.234.228  no       
knative-eventing                                  active           1  knative-eventing         1.12/stable      459  30.0.198.6    no       
knative-operator                                  active           1  knative-operator         1.12/stable      433  30.0.50.229   no       
knative-serving                                   active           1  knative-serving          1.12/stable      487  30.0.133.159  no       
kserve-controller                                 active           1  kserve-controller        0.13/stable      626  30.0.120.62   no       
kubeflow-dashboard                                active           1  kubeflow-dashboard       1.9/stable       659  30.0.21.30    no       
kubeflow-profiles                                 active           1  kubeflow-profiles        1.9/stable       419  30.0.218.171  no       
kubeflow-roles                                    active           1  kubeflow-roles           1.9/stable       240  30.0.149.11   no       
kubeflow-volumes                                  active           1  kubeflow-volumes         1.9/stable       348  30.0.52.239   no       
metacontroller-operator                           active           1  metacontroller-operator  3.0/stable       311  30.0.203.6    no       
minio                    res:oci-image@5102166    active           1  minio                    ckf-1.9/stable   347  30.0.124.129  no       
mlflow-mysql             8.0.37-0ubuntu0.22.04.3  active           3  mysql-k8s                8.0/stable       180  30.0.38.25    no       
mlflow-server                                     active           1  mlflow-server            2.15/stable      638  30.0.61.221   no       
mlmd                                              active           1  mlmd                     ckf-1.9/stable   213  30.0.218.132  no       
oidc-gatekeeper                                   active           1  oidc-gatekeeper          ckf-1.9/stable   423  30.0.77.1     no       
pvcviewer-operator                                active           1  pvcviewer-operator       1.9/stable       157  30.0.114.158  no       
resource-dispatcher                               active           1  resource-dispatcher      2.0/stable       182  30.0.148.189  no       
tensorboard-controller                            active           1  tensorboard-controller   1.9/stable       333  30.0.184.86   no       
tensorboards-web-app                              active           1  tensorboards-web-app     1.9/stable       321  30.0.95.90    no       
training-operator                                 active           1  training-operator        1.8/stable       503  30.0.3.170    no       

Unit                        Workload     Agent  Address       Ports          Message
admission-webhook/0*        active       idle   10.244.9.150                 
argo-controller/0*          active       idle   10.244.9.151                 
dex-auth/0*                 active       idle   10.244.9.152                 
envoy/0*                    active       idle   10.244.9.154                 
istio-ingressgateway/0*     active       idle   10.244.9.153                 
istio-pilot/0*              error        idle   10.244.9.155                 hook failed: "ingress-relation-changed"
jupyter-controller/0*       active       idle   10.244.9.157                 
jupyter-ui/0*               active       idle   10.244.9.158                 
katib-controller/0*         active       idle   10.244.9.160                 
katib-db-manager/0*         active       idle   10.244.6.44                  
katib-db/0*                 active       idle   10.244.9.163                 Primary
katib-db/1                  active       idle   10.244.5.13                  
katib-db/2                  active       idle   10.244.7.14                  
katib-ui/0*                 active       idle   10.244.9.161                 
kfp-api/0*                  active       idle   10.244.9.162                 
kfp-db/0                    active       idle   10.244.9.168                 
kfp-db/1                    active       idle   10.244.7.15                  
kfp-db/2*                   active       idle   10.244.5.14                  Primary
kfp-metadata-writer/0*      active       idle   10.244.6.46                  
kfp-persistence/0*          maintenance  idle   10.244.9.164                 Reconciling charm: executing component container:persistenceagent
kfp-profile-controller/0*   active       idle   10.244.9.165                 
kfp-schedwf/0*              active       idle   10.244.9.167                 
kfp-ui/0*                   active       idle   10.244.9.169                 
kfp-viewer/0*               active       idle   10.244.9.170                 
kfp-viz/0*                  active       idle   10.244.9.171                 
knative-eventing/0*         active       idle   10.244.6.45                  
knative-operator/0*         active       idle   10.244.9.173                 
knative-serving/0*          active       idle   10.244.9.166                 
kserve-controller/0*        active       idle   10.244.9.176                 
kubeflow-dashboard/0*       active       idle   10.244.9.172                 
kubeflow-profiles/0*        active       idle   10.244.6.50                  
kubeflow-roles/0*           active       idle   10.244.6.48                  
kubeflow-volumes/0*         active       idle   10.244.9.175                 
metacontroller-operator/0*  active       idle   10.244.9.174                 
minio/0*                    active       idle   10.244.6.47   9000-9001/TCP  
mlflow-mysql/0              active       idle   10.244.9.187                 
mlflow-mysql/1*             active       idle   10.244.7.16                  Primary
mlflow-mysql/2              active       idle   10.244.9.185                 
mlflow-server/0*            active       idle   10.244.9.186                 
mlmd/0*                     active       idle   10.244.9.188                 
oidc-gatekeeper/0*          active       idle   10.244.9.177                 
pvcviewer-operator/0*       active       idle   10.244.9.183                 
resource-dispatcher/0*      active       idle   10.244.9.179                 
tensorboard-controller/0*   active       idle   10.244.6.51                  
tensorboards-web-app/0*     active       idle   10.244.6.52                  
training-operator/0*        active       idle   10.244.6.49

Additional Context

No response

@natalytvinova natalytvinova added the bug Something isn't working label Sep 5, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6204.

This message was autogenerated

@natalytvinova
Copy link
Author

natalytvinova commented Sep 5, 2024

storage overlay here was not correct, attaching the new one:

applications:
  katib-db:
    storage:
      database: 100G
  kfp-db:
    storage:
      database: 300G
  minio:
    options:
      access-key: include-file://./../../secrets/s3-key.secret
      secret-key: include-file://./../../secrets/s3-secret.secret
    storage:
      minio-data: 100G
  mlflow-mysql:
    storage:
      database: 300G
  mlmd:
    storage:
      mlmd-data: 150G
     

@kimwnasptd
Copy link
Contributor

With a very quick look I can see the following error lines that are interesting

httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'https://30.0.0.1/apis/networking.istio.io/v1alpha3/namespaces/kubeflow/envoyfilters/istio-pilot-authn-filter?force=true&fieldManager=lightkube'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
...
lightkube.core.exceptions.ApiError: Internal error occurred: failed calling webhook "validation.istio.io": failed to call webhook: Post "https://istiod.kubeflow-dev.svc:443/validate?timeout=10s": service "istiod" not found

It looks like

  1. the Charm is trying to apply some resources
  2. K8s needs to talk to Istio's webhook, to validate the resources
  3. The webhook is the istiod charm above
  4. The K8s service is not yet created
  5. K8s complains

But the weird thing is that the log complains that it can't find a service istiod.kubeflow-dev. Why is it kubeflow-dev?

@natalytvinova do they have any extra Istio resources on their side?

@natalytvinova
Copy link
Author

Hm @kimwnasptd that was the name of the model that I gave the first time, but then remembered that the model can only be named "kubeflow". But I destroyed that model, and seems the resources were not cleaned up?

@kimwnasptd
Copy link
Contributor

hmm looks like it. We can confirm this though as well.

The webhook is a cluster scoped resource so there's a chance that the charms might not be cleaning up global resources, but will need to confirm this.

@natalytvinova if you list all validation webhooks in the cluster do you see any that redirects traffic to a service in the kubeflow-dev namespace for Istio?

@natalytvinova
Copy link
Author

@kimwnasptd sorry, how do I list them?

@kimwnasptd
Copy link
Contributor

kubectl get validatingwebhookconfigurations should do the trick

@natalytvinova
Copy link
Author

Thanks! There it is:

 istio-validator-kubeflow                          1          23h
istio-validator-kubeflow-dev                      1          24h

@kimwnasptd
Copy link
Contributor

We'll need to delete this extra istio-validator-kubeflow-dev in this case

So let's confirm that istiod is up and running after:

  1. Deleting the above validating webhook
  2. Deleting the istiod pod in the kubeflow namespace

And then from our team to look in parallel on why this resource (or what other resources) get left behind. How did you then remove the kubeflow-dev model and it's charms?

@natalytvinova
Copy link
Author

I think I might have used --force on the istio-pilot application when it was stuck removing model with juju destroy-model kubeflow-dev --destroy-storage

@natalytvinova
Copy link
Author

Hi @kimwnasptd I cleaned the webhooks, deleted the pod - didn't help. I redeployed the kubeflow, but still it's trying to access the kubeflow-dev. Do you have any ideas of any non namespaced resources that might reference it?

@kimwnasptd
Copy link
Contributor

@natalytvinova could you give me some more information on what the error is?

Is it the same issue on the Istio charm?

@natalytvinova
Copy link
Author

@kimwnasptd the error is still the same as I initially described:

Internal error occurred: failed calling webhook "validation.istio.io": failed to call webhook: Post "https://istiod.kubeflow-dev.svc:443/validate?timeout=10s": service "istiod" not found

I even went ahead and after destroying the model cleaned up all the left crd's - didn't help

@kimwnasptd
Copy link
Contributor

@natalytvinova could you provide a detailed list of steps that you did? I understand it's the following:

  1. Initially deployed Kubeflow in kubeflow-dev model
  2. Then created a new Kubeflow in kubeflow model
    a. was it the same cluster?
    b.How did you cleanup the previous kubeflow-dev model?)
  3. You manually deleted the istio-validator-kubeflow-dev ValidatinoWebhookConfiguration
  4. Deleted the istiod (charm) pod

If the above are not accurate please let me know.

And then you still see the same error? In this case, is there still a istio-validator-kubeflow-dev validation webhook (which means it got re-created)?

@natalytvinova
Copy link
Author

natalytvinova commented Sep 6, 2024

@kimwnasptd Yes, you are correct exactly those steps plus:

    1. Destroyed the model again - still seing the error
    1. Removed all crd's related to Kubeflow
    1. Redeployed the model - still seeing the error
  • 2.a: same cluster

  • 2.b: I cleaned up with juju destroy-model kubeflow-dev --destroy-storage. But that didn't work on istio-pilot application, so I have to do juju remove-application istio-pilot --force

There is no istio-validator-kubeflow-dev anymore. This is what I have:

m$ kubectl get validatingwebhookconfigurations
NAME                                              WEBHOOKS   AGE
aks-node-validating-webhook                       1          11d
config.webhook.eventing.knative.dev               1          26m
inferencegraph.serving.kserve.io                  1          29h
inferenceservice.serving.kserve.io                1          29h
istio-validator-kubeflow                          1          28h
istiod-default-validator                          1          29h
katib.kubeflow.org                                1          28m
pvcviewer-validating-webhook-configuration        1          26m
trainedmodel.serving.kserve.io                    1          29h
validation.inmemorychannel.eventing.knative.dev   1          26m
validation.webhook.eventing.knative.dev           1          26m
validator.training-operator.kubeflow.org          5          27m

@natalytvinova
Copy link
Author

At the end we were able to determine that istiod-default-validator contained the old configuration with: kubectl get validatingwebhookconfigurations -o yaml | grep kubeflow-dev. After deleting the istiod-default-validator entry and restarting the pod the new configuration kicked in.

@kimwnasptd
Copy link
Contributor

Ah, I thought this was taken care of after #549 (comment)

Nice to see getting to the bottom of this. @natalytvinova I've created a follow-up issue to track the bug that the operator didn't update the webhook to the expected value #551

Is there something else to cover in this issue or should we close it?

@natalytvinova
Copy link
Author

@kimwnasptd we can close it, thank you for you help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@kimwnasptd @natalytvinova and others