Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Reduce reconcilements #1359

Merged
merged 1 commit into from
Jun 24, 2024
Merged

🌱 Reduce reconcilements #1359

merged 1 commit into from
Jun 24, 2024

Conversation

janiskemper
Copy link
Contributor

What this PR does / why we need it:

  • Filter HCloudMachine events so that status updates of the resource don't trigger another reconcilement of HCloudMachine
  • Filter HetznerCluster events so that status updates of the resource don't trigger another reconcilement of HetznerCluster
  • Filter HetznerCluster events so that some status updates that are not important don't trigger reconcilement of HCloudMachine
  • Filter Machine events so that status updates of the resource don't trigger another reconcilement of HCloudMachine a
  • Filter Cluster events so that status updates of the resource don't trigger another reconcilement of HetznerCluster
  • Increase the time period to requeue if a server is starting
  • Increase the time period to requeue if a server has not status yet
  • Increase the time period to requeue if the kubeapi server is not yet reachable before adding it as target to the load balancer
  • Don't trigger multiple shutdown calls for one server and increase the timeout to requeue after a shutdown is triggered

TODOs:

  • squash commits
  • include documentation
  • add unit tests

Side note:
I will add an issue for unit testing the functions I have added after this is getting merged.

batistein
batistein previously approved these changes Jun 21, 2024
@janiskemper janiskemper force-pushed the reduce-reconcilements branch from 3cc9b66 to 28c641c Compare June 21, 2024 13:49
@janiskemper janiskemper marked this pull request as ready for review June 21, 2024 13:58
@syself-bot syself-bot bot added the area/code Changes made in the code directory label Jun 21, 2024
Copy link
Contributor

@apricote apricote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes about the predicates, otherwise this looks good and a step in the right direction. Would love to have some better options for finding/debugging this from controller-runtime.

controllers/hcloudmachine_controller.go Show resolved Hide resolved
controllers/hcloudmachine_controller.go Show resolved Hide resolved
@janiskemper janiskemper force-pushed the reduce-reconcilements branch 2 times, most recently from cc156c7 to 0cd4ab0 Compare June 21, 2024 14:15
@syself-bot syself-bot bot added the size/L Denotes a PR that changes 200-800 lines, ignoring generated files. label Jun 21, 2024
- Filter HCloudMachine events so that status updates of the resource
  don't trigger another reconcilement of HCloudMachine
- Filter HetznerCluster events so that status updates of the resource
  don't trigger another reconcilement of HetznerCluster
- Filter HetznerCluster events so that some status updates that are not
  important don't trigger reconcilement of HCloudMachine
- Filter Machine events so that status updates of the resource don't
  trigger another reconcilement of HCloudMachine a
- Filter Cluster events so that status updates of the resource don't
  trigger another reconcilement of HetznerCluster
- Increase the time period to requeue if a server is starting
- Increase the time period to requeue if a server has not status yet
- Increase the time period to requeue if the kubeapi server is not yet
  reachable before adding it as target to the load balancer
- Don't trigger multiple shutdown calls for one server and increase the
  timeout to requeue after a shutdown is triggered
@janiskemper janiskemper requested a review from batistein June 24, 2024 06:58
@janiskemper janiskemper merged commit 76f8b41 into main Jun 24, 2024
9 checks passed
@janiskemper janiskemper deleted the reduce-reconcilements branch June 24, 2024 08:01
@guettli
Copy link
Collaborator

guettli commented Jul 9, 2024

Two notes about the predicates, otherwise this looks good and a step in the right direction. Would love to have some better options for finding/debugging this from controller-runtime.

@apricote what kind of debugging do you mean here? BTW, you can get the total reconcile counts per CRD:

k port-forward caph-controller-manager-XXX 8080:8080
curl localhost:8080/metrics | grep reconcile_total

controller_runtime_reconcile_total{controller="hcloudmachine",result="error"} 0
controller_runtime_reconcile_total{controller="hcloudmachine",result="requeue"} 0
controller_runtime_reconcile_total{controller="hcloudmachine",result="requeue_after"} 162
controller_runtime_reconcile_total{controller="hcloudmachine",result="success"} 59
...

@apricote
Copy link
Contributor

apricote commented Jul 9, 2024

I would love to have a flag in controller-runtime --debug-controller-reconciles that show why a reconcile was triggered:

  • Was it from an event? Was it because of the reconcile interval?
  • If event, which event (type, kind, namespacedname)?
  • Which predicates allowed this event? Which denied?

Some transparent insights into handler.EnqueueRequestsFromMapFunc(), builder.WithPredicates(), WithEventFilter() and friends.

Not something that you want running in our prod clusters, but something that can be enabled locally (or through some diagnostics API) to figure out why it reconciles more often then expected.

@apricote
Copy link
Contributor

apricote commented Jul 9, 2024

One alternative would be distributed tracing for the Kubernetes API that includes any actions taken by controllers because of watches. But I guess that would be way harder and spans many more components.

@guettli
Copy link
Collaborator

guettli commented Jul 9, 2024

@apricote good questions. I have no answer to them, so I asked here to get more insights: https://kubernetes.slack.com/archives/C02MRBMN00Z/p1720524329698439

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code Changes made in the code directory size/L Denotes a PR that changes 200-800 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants