Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - OverconstrainedZonalAllocationRequest error when upgrading #4459

Closed
dduportal opened this issue Dec 18, 2024 · 3 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Dec 18, 2024

Service(s)

Accounts, Artifact-caching-proxy, Azure, ci.jenkins.io, infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io, contributors.jenkins.io, docs.jenkins.io, get.jenkins.io, Incrementals, jenkins.io, LDAP, mirrors.jenkins.io, pkg.jenkins.io, plugins.jenkins.io, stats.jenkins.io, Update center

Summary

While working on #4454, we decided to upgrade the cluster versions to latest 1.29.x patch (from 1.29.7 to 1.29.10) along with upgrading to the latest Azure Linux OS node image.

Using the Azure Portal UI, we started to receive auto-scaling errors (OverconstrainedZonalAllocationRequest) on all the clusters, for each arm64 node pool, such as:

Failed to scale node pool 'systempool1' in Kubernetes service 'infracijenkinsio-agents-1'. Error: Allocation failed. VM(s) with the following constraints cannot be allocated, because the condition is too restrictive. Please remove some constraints and try again. Constraints applied are:
  - Availability Zone
  - Differencing (Ephemeral) Disks
  - Networking Constraints (such as Accelerated Networking or IPv6)
  - VM Size

Followed instructions in https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-zonalallocationfailed-allocationfailed and MicrosoftDocs/azure-docs#41402, we tried different things:

  • As it looks like a capacity issue for Azure in US East 2, zone 1, tried to spin up v6 instances instead of v5 but stuck in not supported yet: [BUG] Dpds_v6 not working with Ephemeral OS disk Azure/AKS#4676 (v20241006 on US East 2 as per the 18 Dec. 2024 while v20241025 is required to support v6 line) leds to errors when creating node pools (unable to use ephemeral disk).
  • Learning that arm64 is not only in US East 2 - Zone 1, decided to move system pools to a distinct zone than its fellow User node pools to decrease pressure on v5 arm64 instances
    • Also decreased the node sizing as we don't use all these resources (despite recommendations from Azure advisor)

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

Update:

dduportal added a commit to jenkins-infra/azure that referenced this issue Dec 18, 2024
…on arm64 vCPUs - jenkins-infra/helpdesk#4459

- Decrease system pool nodes size from D4 to D2
- Decrease the (Ephemeral) OS disk from 150 to 75 Gb
- Move system pools to a distinct zone than the other node pools

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

dduportal commented Dec 18, 2024

Update:

  • Applied jenkins-infra/azure@13b386c manually (by parts) to remove pressure on the arm64 vCPUs in US East 2 - Zone 1 with:
    • Smaller system pool instances (D2 instead D4, 75 Gb Ephemeral disk instead of 150)
    • Distinct Zone for the system pool
  • Re-applied the 1.29.10 upgrade to ensure clusters are out of the Failed Power State with:
#!/bin/bash

az account set --subscription 1311c09f-aee0-4d6c-99a4-392c2b543204

# RG_NAME="infra-ci-jenkins-io-kubernetes-agents"
# AKS_NAME="infracijenkinsio-agents-1"

RG_NAME="ci-jenkins-io-kubernetes-agents"
AKS_NAME="cijenkinsio-agents-1"

AKS_STATE="$(az aks show -g "$RG_NAME" -n "$AKS_NAME" --query provisioningState -o tsv)"

if [ $AKS_STATE == "Failed" ]; then

    AKS_CURRENT_VER=$(az aks show -g $RG_NAME -n $AKS_NAME --query kubernetesVersion -o tsv)

    az aks upgrade \
        --resource-group $RG_NAME \
        --name $AKS_NAME \
        --kubernetes-version $AKS_CURRENT_VER \
        --yes \
        --output none
fi

dduportal added a commit to dduportal/status that referenced this issue Dec 18, 2024
dduportal added a commit to jenkins-infra/status that referenced this issue Dec 18, 2024
@dduportal
Copy link
Contributor Author

Outage looks finished:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment