Clean up auto-generated resources in leader and member clusters #5351

luolanzone · 2023-08-03T07:39:47Z

Split existing stale controller into leader and member folders.
The new stale controller in the leader cluster will do following things:

Check MemberClusterAnnounce periodically in the leader cluster and
delete the stale CR if its last timestamp annotation touch-ts is over 24 hours.
Clean up all corresponding ResourceExports when a MemberClusterAnnounce is deleted.
Clean up stale ResourceExports if there is no existing MemberClusterAnnounce when
the controller is started.
Clean up all MemberClusterAnnounces and corresponding ResourceExports when the controller
is started with no ClusterSet CR.

The ClusterSet controller in leader will remove all remaining ResourceExports and
MemberClusterAnnounces when the ClusterSet CR is deleted in the leader cluster.
The new stale controller in the member cluster will do following things:

Clean up any stale resources when the ClusterSet is ready.
Clean up any stale resources when the controller is restarted.
Delete all local auto-generated resources when there is no ClusterSet CR when controller starts.

The ClusterSet controller will be responsible to remove all imported and exported
resources for the member cluster when the ClusterSet CR is deleted.

luolanzone · 2023-08-03T08:50:24Z

@jianjuns This PR is for cleaning up a member cluster's stale resources on a leader cluster when it's disconnected over 5 minutes (or longer?). The unit test is WIP.

Regarding the stale resources cleanup when the whole ClusterSet is removed, I assume user will delete a member or leader cluster via 'kubectl delete *.yaml'. There will be a few stale resources like AntreaClusterNetworkPolicy CR or multi-cluster Service CR left over. I am thinking we may provide a doc with a few sample scripts to guide user how to remove them.
Or to use PreStop hook. The disadvantage of PreStop is the corresponding resources will be cleaned up every-time when the controller is restarted, it may impact user's Service connection and NP enforcement if the controller is not removed forever. But we may allow users to apply PreStop container only when users indeed plan to delete a ClusterSet.

Let me know what's your thoughts. Thanks.

jianjuns · 2023-08-03T16:45:36Z

No, we need not to auto-delete user created resources, but we should delete MC Controller auto-created resources.

And we should let a member delete MemberClusterAnnounce in leader, and other resources it creates in leader. The leader may auto-delete member resources after MemberClusterAnnounce is deleted too.
But I am not sure we should auto-delete leader cluster resources after member disconnects. If we really do that, we should use a long timeout, like 1 day. We may just add an antctl command or document how to delete MemberClusterAnnounce in a leader.

luolanzone · 2023-08-04T02:41:09Z

Yeah, I agree with you that we should delete MC Controller auto-created resources only. The stale resources AntreaClusterNetworkPolicy/multi-cluster Service CRs I referred in previous comment are auto-created by member antrea-mc-controller. They will be left over if users try to run kubectl delete -f antrea-multicluster-*.yaml to clean up ClusterSet.

The AntreaClusterNetworkPolicy with annotation multicluster.antrea.io/imported-acnp: "true" will be auto-created if users user MCNP replication feature. The multi-cluster Service like antrea-mc-nginx will also be left over since it's Service CR.

For now, we already have a stale controller to clean up the member's MemberClusterAnnounce in the leader cluster if the last update timestamp is over 5 minutes (but only one time during antrea-mc-controller launch). A member controller will also delete the MemberClusterAnnounce when a ClusterSet CR is deleted but not other resources. If we also delete other resources when a ClusterSet CR is deleted, a problem is the existing exported resources won't be added back automatically once a ClusterSet CR is created again.

I will add the logic to auto-delete member resources after MemberClusterAnnounce is deleted in the leader. I will double check if there is more proper way to handle resources cleanup. Thanks for the suggestions.

luolanzone · 2023-08-04T03:28:13Z

@jianjuns I updated the replied comment above, Thanks.

jianjuns · 2023-08-04T04:40:00Z

@luolanzone : we should definitely delete all auto-created resources in a local cluster when a ClusterSet is deleted; and when MemberClusterAnnounce is deleted we should delete all resources created for a member too. And I am not sure we should auto-delete MemberClusterAnnounce when a member disconnects, at least not after just 5 mins (if we want to keep that logic, I would change to 1 day).

If we also delete other resources when a ClusterSet CR is deleted, a problem is the existing exported resources won't be added back automatically once a ClusterSet CR is created again.

I do not understand this. What resources you mean here? In leader or member? In any case, we should first that and recreate the resources when the ClusterSet is created (or say ClusterSet can be created after resources to export).

multicluster/controllers/multicluster/stale_controller.go

multicluster/controllers/multicluster/leader/member_cleanup_controller.go

luolanzone · 2023-08-23T02:39:41Z

/test-multicluster-e2e

jianjuns

When ClusterSet is deleted in either member and leader, we should auto-delete all generated resources in the local cluster. Maybe you want a separate PR for that?

multicluster/controllers/multicluster/stale_controller.go

multicluster/controllers/multicluster/leader/member_cleanup_controller.go

luolanzone · 2023-08-25T01:33:03Z

When ClusterSet is deleted in either member and leader, we should auto-delete all generated resources in the local cluster. Maybe you want a separate PR for that?

I am working on a PR #5438 to clean up all resources in the member and leader with a guide and antctl. But it's more about clean up Antrea Multi-cluster in the leader and member. Do you mean to clean up all generated resources when ClusterSet CR is deleted? I will check this part.

jianjuns · 2023-08-25T04:23:35Z

I am working on a PR to clean up all resources in the member and leader with a guide and antctl. But it's more about clean up Antrea Multi-cluster in the leader and member. Do you mean to clean up all generated resources when ClusterSet CR is deleted? I will check this part.

Yes, I feel MC Controller should auto-delete all CRs it created in the local cluster when the ClusterSet CR is deleted, and MC Controller in the member should also delete MemberClusterAnnounce in the leader.

jianjuns · 2023-08-25T16:00:18Z

Just to add: I am not against to the antctl command approach. Just feel we can still have MC Controller auto-cleanup, and antctl can be a backup solution.

multicluster/controllers/multicluster/stale_controller.go

multicluster/controllers/multicluster/member/clusterset_controller.go

multicluster/controllers/multicluster/member/local_mc_resources_cleanup.go

multicluster/controllers/multicluster/member/clusterset_controller.go

1. Split existing stale controller into leader and member folders. 2. The new stale controller in the leader cluster will do following things: * Check MemberClusterAnnounce periodically in the leader cluster and delete the stale CR if its last timestamp annotation `touch-ts` is over 24 hours. * Clean up all corresponding ResourceExports when a MemberClusterAnnounce is deleted. * Clean up stale ResourceExports if there is no existing MemberClusterAnnounce when the controller is started. * Clean up all MemberClusterAnnounces and corresponding ResourceExports when the controller is started with no ClusterSet CR. 3. The ClusterSet controller in leader will remove all remaining ResourceExports and MemberClusterAnnounces when the ClusterSet CR is deleted in the leader cluster. 4. The new stale controller in the member cluster will do following things: * Clean up any stale resources when the ClusterSet is ready. * Clean up any stale resources when the controller is restarted. 5. The ClusterSet controller will be responsible to remove all imported and exported resources for the member cluster when the ClusterSet CR is deleted. Signed-off-by: Lan Luo <[email protected]>

luolanzone · 2023-10-26T08:46:16Z

/test-multicluster-e2e

jianjuns · 2023-10-26T18:15:05Z

multicluster/controllers/multicluster/member/clusterset_controller.go

+		// Handle create or update
+
+		newLeader := clusterSet.Spec.Leaders[0]
+		if r.installedLeader.clusterID == newLeader.ClusterID && r.installedLeader.serverUrl == newLeader.Server &&


Just found we missed a case that ClusterSet changed (ClusterID and/ro ClusterSet ID changed) but leader does not change, in which case we should still update r.clusterID and clusterSetID.

For now, how about let us go this way:

If either clusterID or clusterSetID changes, it means ClusterSet is recreated, and we should do cleanup with cleanUpResources, and also createRemoteCommonArea.

If just leader attributes change, we do not do cleanup, but just stop and recreate CommonArea with createRemoteCommonArea.

Later we can revisit if we should disallow leader changes or maybe also do cleanup in some cases (like leader ClusterID or serverUrl change).

I do not see you change this.

I think we should do something like:

clusterSetCreated = r.clusterID != common.ClusterID(clusterSet.Spec.ClusterID) || r.clusterSetID != common.ClusterSetID(clusterSet.Name)` leaderChanged = r.installedLeader.clusterID != newLeader.ClusterID || r.installedLeader.serverUrl != newLeader.Server || r.installedLeader.secretName != newLeader.Secret` if !clusterSetCreated && !leaderChanged { return nil } if clusterSetCreated { // line 138 - 161 } return r.createRemoteCommonArea(clusterSet)

Got it, this is more clearer, refined. thanks.

multicluster/controllers/multicluster/member/clusterset_controller.go

multicluster/controllers/multicluster/member/stale_controller.go

jianjuns · 2023-10-26T18:37:21Z

multicluster/controllers/multicluster/member/stale_controller.go

+
+	go func() {
+		for range c.commonAreaCreationCh {
+			retry.OnError(common.CleanUpRetry, func(err error) bool { return true },


I guess one issue with retry.OnError is that if a ClusterSet is recreated before the previous retry is done, we can have two retries ongoing the same time? But probably let us look at a better solution in a follow-up PR.

luolanzone · 2023-10-27T01:28:53Z

/test-multicluster-e2e

multicluster/controllers/multicluster/member/clusterset_controller.go

jianjuns · 2023-10-27T04:21:22Z

multicluster/controllers/multicluster/member/clusterset_controller.go

 			}
 		}
+
+		r.clusterSetConfig = clusterSet.DeepCopy()


From what I saw seems only clusterSet.Generation is really useful to be saved. But then I got another question - probably we should save generation before if !clusterSetCreated && !leaderChanged { return nil }?

Yeah, moved it.

jianjuns · 2023-10-27T04:33:41Z

multicluster/controllers/multicluster/member/clusterset_controller.go

+		clusterSetCreated = r.clusterID != common.ClusterID(clusterSet.Spec.ClusterID) || r.clusterSetID != common.ClusterSetID(clusterSet.Name)
+		leaderChanged := r.installedLeader.clusterID != newLeader.ClusterID || r.installedLeader.serverUrl != newLeader.Server ||
+			r.installedLeader.secretName != newLeader.Secret
+		r.clusterSetConfig = clusterSet.DeepCopy()


Later we may refactor the code to just save Generation, or simply get the ClusterSet generation in updateStatus

Sure, I will go through this part and refine in next release.

luolanzone · 2023-10-27T04:44:02Z

/test-multicluster-e2e

luolanzone · 2023-10-27T04:51:31Z

There is a doc markdown lint failure, but it's because of docs/windows.md, I didn't the issue mentioned in the error logs in this doc, we may ignore it at the moment since it's not related to this PR.

jianjuns · 2023-10-27T04:54:11Z

There is a doc markdown lint failure, but it's because of docs/windows.md, I didn't the issue mentioned in the error logs in this doc, we may ignore it at the moment since it's not related to this PR.

Yes, I saw the failure in other PRs too. Let us rootcause and fix it.

tnqn · 2023-10-27T04:29:57Z

multicluster/cmd/multicluster-controller/leader.go

+	if err = staleController.SetupWithManager(mgr); err != nil {
+		return fmt.Errorf("error creating StaleResCleanupController: %v", err)
+	}
+	go staleController.RunPeriodically(stopCh)


out of curiosity, Run is a common name for such long running routines, why adding a suffix for this case alone?

Let me change it back, there was another function named RunOnce(), so I changed this name before.

tnqn · 2023-10-27T04:45:05Z

multicluster/controllers/multicluster/common/helper.go

+	Steps:    15,
+	Duration: 500 * time.Millisecond,
+	Factor:   2.0,


The maximum wait duration could become 0.5s*2^15 = 4h+, is it expected?

Yeah, it's too long, let me change it considering 4hrs might be meaningless to retry that long.

Changed to 12 which is 30+mins.

tnqn · 2023-10-27T04:51:51Z

multicluster/controllers/multicluster/leader/stale_controller.go

+	for _, resExport := range resourceExports.Items {
+		// The AntreaClusterNetworkPolicy kind of ResourceExport is created in the leader directly
+		// without a ClusterID info. It's not owned by any member cluster.
+		if resExport.Spec.Kind != constants.AntreaClusterNetworkPolicyKind && !existingMemberClusterIDs.Has(resExport.Spec.ClusterID) {


Not familiar with the code, but should the first expression use ==? Otherwise non AntreaClusterNetworkPolicy will be collected by staleResExports.

Yes, we did this to skip AntreaClusterNetworkPolicy ResourceExport because these kind of ResourceExport are created by user, we don't want to clean up them automatically.

multicluster/controllers/multicluster/leader/stale_controller.go

multicluster/controllers/multicluster/member/clusterset_controller.go

luolanzone · 2023-10-27T05:19:03Z

/test-multicluster-e2e

multicluster/controllers/multicluster/member/clusterset_controller.go

Signed-off-by: Lan Luo <[email protected]>

tnqn · 2023-10-27T06:30:44Z

/test-multicluster-e2e
/skip-all

jianjuns · 2023-10-27T17:49:30Z

multicluster/controllers/multicluster/member/clusterset_controller.go

-				// MemberClusterAnnounce could be kept in the leader cluster, if antrea-mc-controller crashes after the failure.
-				// Leader cluster will delete the stale MemberClusterAnnounce with a garbage collection mechanism in this case.
-				return ctrl.Result{}, fmt.Errorf("failed to delete MemberClusterAnnounce in the leader cluster: %v", err)
+		clusterSetNotFound = true


Here we should check if the ClusterSet matches r.clusterSetID?

luolanzone added the area/multi-cluster Issues or PRs related to multi cluster. label Aug 3, 2023

luolanzone added this to the Antrea v1.14 release milestone Aug 15, 2023

luolanzone force-pushed the mc-leader-resource-cleanup branch 4 times, most recently from 5df789e to 02e95d3 Compare August 21, 2023 03:48

luolanzone changed the title ~~[WIP]Add a member resources clean up controller~~ Add a member resources clean up controller Aug 21, 2023

luolanzone force-pushed the mc-leader-resource-cleanup branch from 02e95d3 to acf1846 Compare August 21, 2023 04:03

luolanzone mentioned this pull request Aug 21, 2023

Recreate resources when a member cluster rejoins the ClusterSet #5410

Merged

jianjuns reviewed Aug 21, 2023

View reviewed changes

multicluster/controllers/multicluster/stale_controller.go Outdated Show resolved Hide resolved

jianjuns reviewed Aug 21, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch from acf1846 to b202ce8 Compare August 23, 2023 02:38

jianjuns reviewed Aug 23, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch 3 times, most recently from 582452f to c380278 Compare September 4, 2023 09:48

jianjuns changed the title ~~Add a member resources clean up controller~~ Add a member resources cleanup controller Sep 5, 2023

jianjuns reviewed Sep 5, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch 2 times, most recently from 2779c52 to 48e39f5 Compare September 12, 2023 09:13

luolanzone changed the title ~~Add a member resources cleanup controller~~ Clean up auto-generated resources in leader and member clusters Sep 12, 2023

luolanzone force-pushed the mc-leader-resource-cleanup branch 2 times, most recently from cd0f72d to b711084 Compare October 25, 2023 07:24

jianjuns reviewed Oct 25, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch from b711084 to 866b156 Compare October 26, 2023 02:08

jianjuns reviewed Oct 26, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch from 866b156 to 5791036 Compare October 26, 2023 06:33

luolanzone force-pushed the mc-leader-resource-cleanup branch from 5791036 to bc26130 Compare October 26, 2023 08:44

jianjuns reviewed Oct 26, 2023

View reviewed changes

jianjuns reviewed Oct 27, 2023

View reviewed changes

luolanzone force-pushed the mc-leader-resource-cleanup branch from 79604a0 to 57448e3 Compare October 27, 2023 04:30

jianjuns previously approved these changes Oct 27, 2023

View reviewed changes

tnqn reviewed Oct 27, 2023

View reviewed changes

multicluster/controllers/multicluster/member/clusterset_controller.go Outdated Show resolved Hide resolved

luolanzone dismissed jianjuns’s stale review via 870f345 October 27, 2023 05:18

luolanzone force-pushed the mc-leader-resource-cleanup branch from 57448e3 to 870f345 Compare October 27, 2023 05:18

tnqn reviewed Oct 27, 2023

View reviewed changes

multicluster/controllers/multicluster/member/clusterset_controller.go Outdated Show resolved Hide resolved

multicluster/controllers/multicluster/member/clusterset_controller.go Outdated Show resolved Hide resolved

Address comments

94de96c

Signed-off-by: Lan Luo <[email protected]>

luolanzone force-pushed the mc-leader-resource-cleanup branch from 870f345 to 94de96c Compare October 27, 2023 06:02

tnqn approved these changes Oct 27, 2023

View reviewed changes

tnqn merged commit c8d8ffd into antrea-io:main Oct 27, 2023

luolanzone deleted the mc-leader-resource-cleanup branch October 27, 2023 08:05

jianjuns reviewed Oct 27, 2023

View reviewed changes

Clean up auto-generated resources in leader and member clusters #5351

Clean up auto-generated resources in leader and member clusters #5351

Conversation

luolanzone commented Aug 3, 2023 • edited Loading

luolanzone commented Aug 3, 2023

jianjuns commented Aug 3, 2023

luolanzone commented Aug 4, 2023 • edited Loading

luolanzone commented Aug 4, 2023

jianjuns commented Aug 4, 2023

luolanzone commented Aug 23, 2023

jianjuns left a comment • edited Loading

Choose a reason for hiding this comment

luolanzone commented Aug 25, 2023 • edited Loading

jianjuns commented Aug 25, 2023

jianjuns commented Aug 25, 2023

luolanzone commented Oct 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianjuns Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone commented Oct 27, 2023

luolanzone commented Oct 27, 2023 • edited Loading

jianjuns commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone commented Oct 27, 2023

tnqn commented Oct 27, 2023

Choose a reason for hiding this comment

luolanzone commented Aug 3, 2023 •

edited

Loading

luolanzone commented Aug 4, 2023 •

edited

Loading

jianjuns left a comment •

edited

Loading

luolanzone commented Aug 25, 2023 •

edited

Loading

jianjuns Oct 27, 2023 •

edited

Loading

luolanzone commented Oct 27, 2023 •

edited

Loading

luolanzone Oct 27, 2023 •

edited

Loading