Enable high availability (HA) configuration for ASO #4445

bingikarthik · 2024-11-13T16:50:01Z

What this PR does

Closes #4215
As we adopt Azure Service Operator (ASO), it's essential to run multiple replicas for production workloads, especially during cluster upgrade operations. This PR enables high availability (HA) for ASO by increasing the replica count and implementing a Pod Disruption Budget (PDB) to maintain consistent uptime and resilience during potential disruptions.

How does this PR make you feel?

Checklist

this PR contains documentation
this PR contains tests
this PR contains YAML Samples

config/manager/manager.yaml

config/default/manager_pod_disruption_budget.yaml

v2/config/default/manager_pod_disruption_budget.yaml

theunrepentantgeek

How will things behave when upgrading ASO to the next version?

I'm worried about the scenario where the version N+1 of ASO deploys new CRD versions which (still running) version N doesn't understand.

In this scenario, since we use the newest resource version as the storage (hub) version of the resource, any resource that has been touched by the version N+1 will be unintelligible to version N, resulting in a panic/crash.

...zure-service-operator/templates/policy_v1_ poddisruptionbudget_azureserviceoperator-pdb.yaml

bingikarthik · 2024-11-18T08:43:31Z

@matthchr I've made the requested changes. When you have a moment, could you kindly review the PR?

matthchr · 2024-11-18T19:05:12Z

@bingikarthik - thanks! I think we need to make a deployment rollout change to make this safe (as @theunrepentantgeek called out). I'll send a separate PR for that and link it here, and then once that merges we can make sure this change is safe w/ it, and assuming it is, merge this too.

Thanks for your patience

nishant221 · 2024-11-20T06:43:19Z

Does ASO support "leader" kind of approach? If multiple replicas are running (and reconciling independently), can the same request can be processed by multiple instances of ASO resulting in duplicate calls to Azure?

bingikarthik · 2024-11-20T08:10:45Z

Does ASO support "leader" kind of approach? If multiple replicas are running (and reconciling independently), can the same request can be processed by multiple instances of ASO resulting in duplicate calls to Azure?

Yes, it was indeed. Please check: https://github.com/Azure/azure-service-operator/blob/main/v2/charts/azure-service-operator/templates/apps_v1_deployment_azureserviceoperator-controller-manager.yaml#L59
https://github.com/Azure/azure-service-operator/blob/main/main.go#L63

matthchr · 2024-11-20T00:03:36Z

v2/config/manager/manager_pod_disruption_budget.yaml

+metadata:
+  name: controller-manager
+  namespace: system
+  labels:


This needs app.kubernetes.io/name: azure-service-operator label too, to match Helm

matthchr · 2024-11-20T00:10:22Z

v2/config/manager/manager_pod_disruption_budget.yaml

+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: controller-manager


Should be pdb not controller-manager.

matthchr · 2024-11-27T20:50:10Z

had a few comments - once they're fixed I'll kick off CI and we can merge this.

bingikarthik requested review from davefellows, theunrepentantgeek, matthchr, babbageclunk and super-harsh as code owners November 13, 2024 16:50

matthchr reviewed Nov 13, 2024

View reviewed changes

config/manager/manager.yaml Outdated Show resolved Hide resolved

config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved

config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved

matthchr reviewed Nov 13, 2024

View reviewed changes

v2/config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved

v2/config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved

theunrepentantgeek reviewed Nov 13, 2024

View reviewed changes

...zure-service-operator/templates/policy_v1_ poddisruptionbudget_azureserviceoperator-pdb.yaml Outdated Show resolved Hide resolved

matthchr mentioned this pull request Nov 23, 2024

Support multiple replicas of ASO pod #4466

Merged

3 tasks

Bingi Narasimha Karthik added 5 commits November 27, 2024 11:45

Enable high availability (HA) configuration for ASO

b21be9a

Update pdb selector and move refs under v2/

700afae

Move manager_pod_disruption_budget.yaml and update selector for PDB

8f4c14d

Remove version label from pdb

866c24c

Add enable condition for PDB

3640dd0

bingikarthik force-pushed the ASO_HA branch from ff13230 to 3640dd0 Compare November 27, 2024 06:17

matthchr approved these changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable high availability (HA) configuration for ASO #4445

Enable high availability (HA) configuration for ASO #4445

bingikarthik commented Nov 13, 2024

theunrepentantgeek left a comment

bingikarthik commented Nov 18, 2024

matthchr commented Nov 18, 2024 •

edited

Loading

nishant221 commented Nov 20, 2024

bingikarthik commented Nov 20, 2024

matthchr Nov 20, 2024

matthchr Nov 20, 2024

matthchr commented Nov 27, 2024

Enable high availability (HA) configuration for ASO #4445

Are you sure you want to change the base?

Enable high availability (HA) configuration for ASO #4445

Conversation

bingikarthik commented Nov 13, 2024

What this PR does

How does this PR make you feel?

Checklist

theunrepentantgeek left a comment

Choose a reason for hiding this comment

bingikarthik commented Nov 18, 2024

matthchr commented Nov 18, 2024 • edited Loading

nishant221 commented Nov 20, 2024

bingikarthik commented Nov 20, 2024

matthchr Nov 20, 2024

Choose a reason for hiding this comment

matthchr Nov 20, 2024

Choose a reason for hiding this comment

matthchr commented Nov 27, 2024

matthchr commented Nov 18, 2024 •

edited

Loading