plugin site backend OOM killed #3774

smerle33 · 2023-10-05T12:30:13Z

Service(s)

plugins.jenkins.io

Summary

while checking on deployments on our publick8s kubernetes instance, I noticed that the plugin-site backend pod was restarted for OOMKilled:

│ NAMESPACE↑                  NAME                                                                 PF     READY         RESTARTS STATUS           CPU      MEM     %CPU/R     %CPU/L     %MEM/R     %MEM/L IP                NODE                                     AGE          │
│ plugin-site                 plugin-site-backend-7fcb4c77c8-z294l                                 ●      1/1                277 Running            9     1940          1          0        189         94 10.100.13.24      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-frontend-59bfb957c4-289cl                                ●      1/1                  0 Running            2       12          2          2         38         38 10.100.4.30       aks-x86medium-20522204-vmss000005        12d          │
│ plugin-site                 plugin-site-frontend-59bfb957c4-l6gl2                                ●      1/1                  0 Running            2       26          2          2         82         82 10.100.13.17      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-issues-54f968bd64-7cszz                                  ●      1/1                  0 Running           20      123        n/a        n/a        n/a        n/a 10.100.13.18      aks-x86medium-20522204-vmss00000i        8d

277 restart as for now.

Reproduction steps

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-05T12:30:37Z

Take a look at these similar issues to see if there isn't already a response to your problem:

72% Plugin not found on plugin site #3696

smerle33 · 2023-10-06T07:14:50Z

TLTR : incompatibility of the application image with cgroup v2 and the problem has started after the kubernetes upgrade : Cluster was recently upgraded to Kubernetes 1.25

First step is to check on datadog and if we go back in time for 4 month, we can see that the memory issues and pod restart for the plugin-site-backend, are matching a specific date beginning of june (around june 8th) :

to confirm we can check in azure, within Resource health from the publick8s cluster in diagnose and solve problems and then node health :

and the solution is explained here

So we need to work on the image underlying the plugin-site-backend : https://github.com/jenkins-infra/plugin-site-api
to make sure it use a patched version of jdk that is compliant with cgroup v2 (https://github.com/jenkins-infra/plugin-site-api/blob/416279518ff3444904c28b1ef3aa56ca3ff7d38b/Dockerfile#L1) with a parent image like jetty:9-jdk8 or more specific like jetty:9.4.52-jdk8-eclipse-temurin

as a side work, we may also want to move the build of this image from trusted.ci to infra.ci

as per jenkins-infra/helpdesk#3774 need to update the jdk to a compatible with `cgroup v2`

smerle33 · 2023-10-09T06:56:32Z

locked by #3778

dduportal · 2023-10-11T13:15:26Z

#3778 is fixed. The new plugin-site-api container image deployed is working as expected and keeps being OOM-killed: we need to deploy the latest version with the changes from jenkins-infra/plugin-site-api#119

dduportal · 2023-10-11T13:35:04Z

Helm chart update with a successfull test of the new memory limit finally enforced: jenkins-infra/helm-charts#855

dduportal · 2023-10-11T15:24:00Z

jenkins-infra/kubernetes-management#4527 deployed the new image to production. No service outage and the new pod seems to use the expected amount of memory:

smerle33 added the triage Incoming issues that need review label Oct 5, 2023

jenkins-infra-helpdesk-app bot added the plugins.jenkins.io label Oct 5, 2023

dduportal added this to the infra-team-sync-2023-10-10 milestone Oct 5, 2023

dduportal assigned smerle33 Oct 5, 2023

dduportal removed the triage Incoming issues that need review label Oct 5, 2023

smerle33 added a commit to jenkins-infra/plugin-site-api that referenced this issue Oct 6, 2023

Bump jdk version of the parent image

1597d91

as per jenkins-infra/helpdesk#3774 need to update the jdk to a compatible with `cgroup v2`

This was referenced Oct 6, 2023

feat: Switch JDK version of the parent image to latest Temurin JDK8 jenkins-infra/plugin-site-api#119

Merged

Plugin Site API is not building since february #3778

Closed

dduportal modified the milestones: infra-team-sync-2023-10-10, infra-team-sync-2023-10-17 Oct 10, 2023

dduportal mentioned this issue Oct 11, 2023

fix(plugin-site) bump Docker image to switch JDK8 to Temurin 1.8.0_382 (and support of cgroups v2) jenkins-infra/helm-charts#855

Merged

dduportal closed this as completed Oct 11, 2023

dduportal mentioned this issue Oct 20, 2023

plugin-site build commonly fails on infra.ci when accessing https://plugins.jenkins.io resulting in a 502 #3697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugin site backend OOM killed #3774

plugin site backend OOM killed #3774

smerle33 commented Oct 5, 2023

github-actions bot commented Oct 5, 2023

smerle33 commented Oct 6, 2023

smerle33 commented Oct 9, 2023

dduportal commented Oct 11, 2023

dduportal commented Oct 11, 2023

dduportal commented Oct 11, 2023

plugin site backend OOM killed #3774

plugin site backend OOM killed #3774

Comments

smerle33 commented Oct 5, 2023

Service(s)

Summary

Reproduction steps

github-actions bot commented Oct 5, 2023

Take a look at these similar issues to see if there isn't already a response to your problem:

smerle33 commented Oct 6, 2023

smerle33 commented Oct 9, 2023

dduportal commented Oct 11, 2023

dduportal commented Oct 11, 2023

dduportal commented Oct 11, 2023