AKS outages (publick8s, privatek8s, ci.jio-agents-1 and infra.ci.jio-agents-1) - OverconstrainedZonalAllocationRequest
error when upgrading
#4459
Labels
accounts
artifact-caching-proxy
azure
ci.jenkins.io
ci.jenkins.io-agents-1
contributors.jenkins.io
docs.jenkins.io
get.jenkins.io
incrementals
infra.ci.jenkins.io
infra.ci.jenkins.io-agents-1
jenkins.io
ldap
mirrors.jenkins.io
pkg.jenkins.io
plugins.jenkins.io
privatek8s
publick8s
release.ci.jenkins.io
stats.jenkins.io
updateCenter
weekly.ci.jenkins.io
Milestone
Service(s)
Accounts, Artifact-caching-proxy, Azure, ci.jenkins.io, infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io, contributors.jenkins.io, docs.jenkins.io, get.jenkins.io, Incrementals, jenkins.io, LDAP, mirrors.jenkins.io, pkg.jenkins.io, plugins.jenkins.io, stats.jenkins.io, Update center
Summary
While working on #4454, we decided to upgrade the cluster versions to latest 1.29.x patch (from
1.29.7
to1.29.10
) along with upgrading to the latest Azure Linux OS node image.Using the Azure Portal UI, we started to receive auto-scaling errors (
OverconstrainedZonalAllocationRequest
) on all the clusters, for eacharm64
node pool, such as:Followed instructions in https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-zonalallocationfailed-allocationfailed and MicrosoftDocs/azure-docs#41402, we tried different things:
US East 2
, zone 1, tried to spin upv6
instances instead ofv5
but stuck in not supported yet: [BUG] Dpds_v6 not working with Ephemeral OS disk Azure/AKS#4676 (v20241006
on US East 2 as per the 18 Dec. 2024 whilev20241025
is required to supportv6
line) leds to errors when creating node pools (unable to use ephemeral disk).arm64
is not only inUS East 2 - Zone 1
, decided to move system pools to a distinct zone than its fellow User node pools to decrease pressure onv5
arm64
instancesReproduction steps
No response
The text was updated successfully, but these errors were encountered: