Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: dynamic cluster distribution issue 20965, update the shard… #21042

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ivan-cai
Copy link

@ivan-cai ivan-cai commented Dec 3, 2024

… number by readiness health check

bugfix for #20965, Dynamic Cluster Distribution

When the shard configmap argocd-app-controller-shard-cm exists and the number of replicas is greater than 1, the shard number of the app controller pod restarted due to image update or pod eviction will be -1. Then, the app controller shard will not process applications, and applications stuck in refreshing.

We try to update the shard number of the application controller(clusterSharding.shard and liveStateCache.clusterSharding.shard) in the readiness health check if the shard number have changed, and the resync all applications.

@ivan-cai ivan-cai requested a review from a team as a code owner December 3, 2024 11:06
Copy link

bunnyshell bot commented Dec 3, 2024

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

@ivan-cai ivan-cai changed the title bugfix for dynamic cluster distribution issue 20965, update the shard… fix: dynamic cluster distribution issue 20965, update the shard… Dec 3, 2024
@ivan-cai ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch 6 times, most recently from 6698dc3 to c93c7b3 Compare December 3, 2024 12:14
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 0% with 46 lines in your changes missing coverage. Please review.

Project coverage is 55.16%. Comparing base (f429352) to head (66121a8).

Files with missing lines Patch % Lines
controller/sharding/cache.go 0.00% 22 Missing ⚠️
controller/appcontroller.go 0.00% 19 Missing ⚠️
controller/sharding/sharding.go 0.00% 2 Missing and 1 partial ⚠️
controller/cache/cache.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21042      +/-   ##
==========================================
- Coverage   55.19%   55.16%   -0.04%     
==========================================
  Files         337      337              
  Lines       57055    57099      +44     
==========================================
+ Hits        31492    31497       +5     
- Misses      22870    22913      +43     
+ Partials     2693     2689       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

key, err := cache.MetaNamespaceKeyFunc(app)
if err == nil {
ctrl.appRefreshQueue.AddRateLimited(key)
ctrl.appOperationQueue.AddRateLimited(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add this, it would be added by refresh processing code after refresh is completed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resync(enqueue) code is necessary. Otherwise, even if the shard number is changed, the application controller will not process the applications immediately, but will wait for the default resync time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would refresh not be triggered immediately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this resync code, it will work, otherwise it will not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has changes in 2.13, where refresh enqueues app operation after being done. This prevents some race conditions and makes sure the sync is done with a fresh version of data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically we need two resyncs, one initially and one after the refresh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the application controller will not process the applications immediately, but will wait for the default resync time.

Why would that happen?

Copy link
Author

@ivan-cai ivan-cai Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically we need two resyncs, one initially and one after the refresh?

Yes, I have seen the changes, I will delete this line ctrl.appOperationQueue.AddRateLimited(key).
Thank you for your review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Good to clarify. We have one case where we may enqueue app operation immediately and another after some delay.

}
for _, app := range apps {
if !ctrl.canProcessApp(app) {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to continue instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to continue instead?

It is the same as the AddFunc code, only enqueue the applications that the shard is responsible for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I mean you'd return from the function in the middle of the loop going through apps, so some apps may be left behind even if ctrl can process them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@ivan-cai ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from c93c7b3 to 98f58c1 Compare December 4, 2024 02:16
Comment on lines 291 to 298
if !ctrl.canProcessApp(app) {
continue
}
key, err := cache.MetaNamespaceKeyFunc(app)
if err == nil {
ctrl.appRefreshQueue.AddRateLimited(key)
}
if err == nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrii-korotkov-verkada I have fixed the two problems you mentioned.

}
if err == nil {
ctrl.clusterSharding.AddApp(app)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine the two blocks, since they check for the same condition. Or, do continue if err != nil. We also probably wanna log the error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined!

@ivan-cai ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from 98f58c1 to a31de81 Compare December 4, 2024 02:24
@andrii-korotkov-verkada
Copy link
Contributor

Approved. Btw, consider using new commits instead of force pushing, that can help to see which changes you are making and tell some story. They are squashed when merged.

@ivan-cai
Copy link
Author

ivan-cai commented Dec 4, 2024

/retest

@ivan-cai ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from e32bc65 to 66121a8 Compare December 24, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready for final review
Development

Successfully merging this pull request may close these issues.

3 participants