AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - `OverconstrainedZonalAllocationRequest` error when upgrading #4459

dduportal · 2024-12-18T13:51:48Z

Service(s)

Accounts, Artifact-caching-proxy, Azure, ci.jenkins.io, infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io, contributors.jenkins.io, docs.jenkins.io, get.jenkins.io, Incrementals, jenkins.io, LDAP, mirrors.jenkins.io, pkg.jenkins.io, plugins.jenkins.io, stats.jenkins.io, Update center

Summary

While working on #4454, we decided to upgrade the cluster versions to latest 1.29.x patch (from 1.29.7 to 1.29.10) along with upgrading to the latest Azure Linux OS node image.

Using the Azure Portal UI, we started to receive auto-scaling errors (OverconstrainedZonalAllocationRequest) on all the clusters, for each arm64 node pool, such as:

Failed to scale node pool 'systempool1' in Kubernetes service 'infracijenkinsio-agents-1'. Error: Allocation failed. VM(s) with the following constraints cannot be allocated, because the condition is too restrictive. Please remove some constraints and try again. Constraints applied are:
  - Availability Zone
  - Differencing (Ephemeral) Disks
  - Networking Constraints (such as Accelerated Networking or IPv6)
  - VM Size

Followed instructions in https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-zonalallocationfailed-allocationfailed and MicrosoftDocs/azure-docs#41402, we tried different things:

As it looks like a capacity issue for Azure in US East 2, zone 1, tried to spin up v6 instances instead of v5 but stuck in not supported yet: [BUG] Dpds_v6 not working with Ephemeral OS disk Azure/AKS#4676 (v20241006 on US East 2 as per the 18 Dec. 2024 while v20241025 is required to support v6 line) leds to errors when creating node pools (unable to use ephemeral disk).
Learning that arm64 is not only in US East 2 - Zone 1, decided to move system pools to a distinct zone than its fellow User node pools to decrease pressure on v5 arm64 instances
- Also decreased the node sizing as we don't use all these resources (despite recommendations from Azure advisor)

Reproduction steps

No response

The text was updated successfully, but these errors were encountered:

dduportal · 2024-12-18T13:57:32Z

Update:

Announced incident in jenkins-infra/status@9f616cd (late), and operation in https://matrix.to/#/!JLUOInpEYmxJIYXlzs:matrix.org/$t_7Fm8wYiUykdY3SqcEIOR0t8fcnqWnEH1l4Y5BFCSw?via=g4v.dev&via=gitter.im&via=matrix.org
Triggered cluster upgrade in bump AKS from 1.29.7 to 1.29.10 azure#904
hot-fixed datadog release in infra.ci to ensure the cluster agent component can run in the temporary "System Pools" created/deleted by Terraform when changing the AKS default node pools characteristics: jenkins-infra/kubernetes-management@c2bf9fd

…on arm64 vCPUs - jenkins-infra/helpdesk#4459 - Decrease system pool nodes size from D4 to D2 - Decrease the (Ephemeral) OS disk from 150 to 75 Gb - Move system pools to a distinct zone than the other node pools Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2024-12-18T14:04:12Z

Update:

Applied jenkins-infra/azure@13b386c manually (by parts) to remove pressure on the arm64 vCPUs in US East 2 - Zone 1 with:
- Smaller system pool instances (D2 instead D4, 75 Gb Ephemeral disk instead of 150)
- Distinct Zone for the system pool
Re-applied the 1.29.10 upgrade to ensure clusters are out of the Failed Power State with:

#!/bin/bash

az account set --subscription 1311c09f-aee0-4d6c-99a4-392c2b543204

# RG_NAME="infra-ci-jenkins-io-kubernetes-agents"
# AKS_NAME="infracijenkinsio-agents-1"

RG_NAME="ci-jenkins-io-kubernetes-agents"
AKS_NAME="cijenkinsio-agents-1"

AKS_STATE="$(az aks show -g "$RG_NAME" -n "$AKS_NAME" --query provisioningState -o tsv)"

if [ $AKS_STATE == "Failed" ]; then

    AKS_CURRENT_VER=$(az aks show -g $RG_NAME -n $AKS_NAME --query kubernetesVersion -o tsv)

    az aks upgrade \
        --resource-group $RG_NAME \
        --name $AKS_NAME \
        --kubernetes-version $AKS_CURRENT_VER \
        --yes \
        --output none
fi

Signed-off-by: Damien Duportal <[email protected]>

close AKS outage - jenkins-infra/helpdesk#4459

dduportal · 2024-12-18T14:14:41Z

Outage looks finished:

All AKS clusters and their node pools are in Succeeded provisioning state and running Kubernetes 1.29.10
Closed in close AKS outage - https://github.com/jenkins-infra/helpdesk/issues/4459 status#572
All Terraform/Kubernetes jobs are back to normal

dduportal added azure publick8s privatek8s infra.ci.jenkins.io-agents-1 ci.jenkins.io-agents-1 labels Dec 18, 2024

dduportal added this to the infra-team-sync-2025-01-07 milestone Dec 18, 2024

dduportal mentioned this issue Dec 18, 2024

bump AKS from 1.29.7 to 1.29.10 jenkins-infra/azure#904

Merged

dduportal added a commit to dduportal/status that referenced this issue Dec 18, 2024

close AKS outage - jenkins-infra/helpdesk#4459

7370c07

Signed-off-by: Damien Duportal <[email protected]>

dduportal added a commit to jenkins-infra/status that referenced this issue Dec 18, 2024

Merge pull request #572 from dduportal/close/aks-outage

948a5f3

close AKS outage - jenkins-infra/helpdesk#4459

dduportal mentioned this issue Dec 18, 2024

close AKS outage - https://github.com/jenkins-infra/helpdesk/issues/4459 jenkins-infra/status#572

Merged

dduportal closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - `OverconstrainedZonalAllocationRequest` error when upgrading #4459

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - `OverconstrainedZonalAllocationRequest` error when upgrading #4459

dduportal commented Dec 18, 2024 •

edited

Loading

dduportal commented Dec 18, 2024

dduportal commented Dec 18, 2024 •

edited

Loading

dduportal commented Dec 18, 2024

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - OverconstrainedZonalAllocationRequest error when upgrading #4459

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - OverconstrainedZonalAllocationRequest error when upgrading #4459

Comments

dduportal commented Dec 18, 2024 • edited Loading

Service(s)

Summary

Reproduction steps

dduportal commented Dec 18, 2024

dduportal commented Dec 18, 2024 • edited Loading

dduportal commented Dec 18, 2024

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - `OverconstrainedZonalAllocationRequest` error when upgrading #4459

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - `OverconstrainedZonalAllocationRequest` error when upgrading #4459

dduportal commented Dec 18, 2024 •

edited

Loading

dduportal commented Dec 18, 2024 •

edited

Loading