Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upgrade kubernetes client from 11.0.0 -> 24.2.0, implement List+Watch in KubeWatcher #32

Merged
merged 9 commits into from
Jan 8, 2024

Conversation

bodom0015
Copy link
Member

@bodom0015 bodom0015 commented Dec 18, 2023

Problem

Our Kubernetes Python client is horribly outdated

Intermittent HTTP 500 errors when connecting to Kube API. Once first encountered, this error loops endlessly

Approach

How to Test

This image has been deployed to job-manager-staging

CLEAN

  1. Navigate to https://clean.frontend.staging.mmli1.ncsa.illinois.edu/configuration
  2. Submit a new CLEAN job
  3. Wait for job to complete

MOLLI

  1. Navigate to https://molli.frontend.staging.mmli1.ncsa.illinois.edu/configuration
  2. Submit a new MOLLI job
  3. Wait for job to complete

Error Handling

With no way to reliably reproduce the error, all we can do is wait for a few days and watch the logs to see if the error surfaces again 😔

  1. Switch to mmli1 cluster: kubectl config use-context mmli1
  2. Check logs for job-manager-staging: kubectl logs -f deploy/job-manager-staging -n staging
  3. If you see the following error (it loops endlessly), then the issue has not been fixed:
2023-12-17 14:35:22,708 [global_vars ] INFO     KubeWatcher is connecting...
2023-12-17 14:35:22,708 [global_vars ] INFO     KubeWatcher connected!
2023-12-17 14:35:22,716 [global_vars ] ERROR    HTTPError encountered - KubeWatcher reconnecting to Kube API: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '7f96ffff-4d2f-4354-ac60-dce5d41c931a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '15dd202d-9244-45ab-a864-3f1580216460', 'X-Kubernetes-Pf-Prioritylevel-Uid': '504fafa2-5644-4478-8b81-9948f41552e2', 'Date': 'Sun, 17 Dec 2023 14:35:22 GMT', 'Content-Length': '186'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"resourceVersion: Invalid value: \\"None\\": strconv.ParseUint: parsing \\"None\\": invalid syntax","code":500}\n'

Comment on lines +29 to +33
#configuration = client.Configuration()
#api_batch_v1 = client.BatchV1Api(client.ApiClient(configuration))
#api_v1 = client.CoreV1Api(client.ApiClient(configuration))
api_batch_v1 = client.BatchV1Api()
api_v1 = client.CoreV1Api()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding configuration has changed between versions, but this is not required for how we are using the K8S API client

@bodom0015 bodom0015 changed the title feat: upgrade kubernetes client from 11.0.0 -> 24.2.0 feat: upgrade kubernetes client from 11.0.0 -> 24.2.0, implement List+Watch in KubeWatcher Dec 19, 2023
Comment on lines -73 to +84
# Resource version is used to keep track of stream progress (in case of resume)
# List all pods in watched namespace to get resource_version
namespaced_jobs: V1JobList = kubejob.api_batch_v1.list_namespaced_job(namespace=kubejob.get_namespace())
resource_version = namespaced_jobs.metadata.resource_version if namespaced_jobs.metadata.resource_version else resource_version

# Then, watch for new events using the most recent resource_version
# Resource version is used to keep track of stream progress (in case of resume/retry)
k8s_event_stream = w.stream(func=kubejob.api_batch_v1.list_namespaced_job,
namespace=kubejob.get_namespace(),
timeout_seconds=timeout_seconds,
resource_version=resource_version)
resource_version=resource_version,
timeout_seconds=timeout_seconds)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempt to implement List+Watch pattern, as described here:
kubernetes-client/python#843 (comment)

@bodom0015 bodom0015 merged commit def1b47 into main Jan 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant