Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin site backend OOM killed #3774

Closed
smerle33 opened this issue Oct 5, 2023 · 6 comments
Closed

plugin site backend OOM killed #3774

smerle33 opened this issue Oct 5, 2023 · 6 comments

Comments

@smerle33
Copy link
Contributor

smerle33 commented Oct 5, 2023

Service(s)

plugins.jenkins.io

Summary

while checking on deployments on our publick8s kubernetes instance, I noticed that the plugin-site backend pod was restarted for OOMKilled:

│ NAMESPACE↑                  NAME                                                                 PF     READY         RESTARTS STATUS           CPU      MEM     %CPU/R     %CPU/L     %MEM/R     %MEM/L IP                NODE                                     AGE          │
│ plugin-site                 plugin-site-backend-7fcb4c77c8-z294l                                 ●      1/1                277 Running            9     1940          1          0        189         94 10.100.13.24      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-frontend-59bfb957c4-289cl                                ●      1/1                  0 Running            2       12          2          2         38         38 10.100.4.30       aks-x86medium-20522204-vmss000005        12d          │
│ plugin-site                 plugin-site-frontend-59bfb957c4-l6gl2                                ●      1/1                  0 Running            2       26          2          2         82         82 10.100.13.17      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-issues-54f968bd64-7cszz                                  ●      1/1                  0 Running           20      123        n/a        n/a        n/a        n/a 10.100.13.18      aks-x86medium-20522204-vmss00000i        8d   

277 restart as for now.

Reproduction steps

No response

@smerle33 smerle33 added the triage Incoming issues that need review label Oct 5, 2023
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 72% Plugin not found on plugin site #3696

@dduportal dduportal added this to the infra-team-sync-2023-10-10 milestone Oct 5, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Oct 5, 2023
@smerle33
Copy link
Contributor Author

smerle33 commented Oct 6, 2023

TLTR : incompatibility of the application image with cgroup v2 and the problem has started after the kubernetes upgrade : Cluster was recently upgraded to Kubernetes 1.25


First step is to check on datadog and if we go back in time for 4 month, we can see that the memory issues and pod restart for the plugin-site-backend, are matching a specific date beginning of june (around june 8th) :

Capture d’écran 2023-10-06 à 08 55 06 Capture d’écran 2023-10-06 à 08 55 23

to confirm we can check in azure, within Resource health from the publick8s cluster in diagnose and solve problems and then node health :

Capture d’écran 2023-10-06 à 09 06 21 Capture d’écran 2023-10-06 à 09 06 33

and the solution is explained here

Capture d’écran 2023-10-06 à 09 06 41

So we need to work on the image underlying the plugin-site-backend : https://github.com/jenkins-infra/plugin-site-api
to make sure it use a patched version of jdk that is compliant with cgroup v2 (https://github.com/jenkins-infra/plugin-site-api/blob/416279518ff3444904c28b1ef3aa56ca3ff7d38b/Dockerfile#L1) with a parent image like jetty:9-jdk8 or more specific like jetty:9.4.52-jdk8-eclipse-temurin

as a side work, we may also want to move the build of this image from trusted.ci to infra.ci

smerle33 added a commit to jenkins-infra/plugin-site-api that referenced this issue Oct 6, 2023
as per jenkins-infra/helpdesk#3774
need to update the jdk to a compatible with `cgroup v2`
@smerle33
Copy link
Contributor Author

smerle33 commented Oct 9, 2023

locked by #3778

@dduportal
Copy link
Contributor

#3778 is fixed. The new plugin-site-api container image deployed is working as expected and keeps being OOM-killed: we need to deploy the latest version with the changes from jenkins-infra/plugin-site-api#119

@dduportal
Copy link
Contributor

Helm chart update with a successfull test of the new memory limit finally enforced: jenkins-infra/helm-charts#855

@dduportal
Copy link
Contributor

jenkins-infra/kubernetes-management#4527 deployed the new image to production. No service outage and the new pod seems to use the expected amount of memory:

Capture d’écran 2023-10-11 à 17 23 11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants