-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up auto-generated resources in leader and member clusters #5351
Conversation
@jianjuns This PR is for cleaning up a member cluster's stale resources on a leader cluster when it's disconnected over 5 minutes (or longer?). The unit test is WIP. Regarding the stale resources cleanup when the whole ClusterSet is removed, I assume user will delete a member or leader cluster via 'kubectl delete *.yaml'. There will be a few stale resources like AntreaClusterNetworkPolicy CR or multi-cluster Service CR left over. I am thinking we may provide a doc with a few sample scripts to guide user how to remove them. Let me know what's your thoughts. Thanks. |
No, we need not to auto-delete user created resources, but we should delete MC Controller auto-created resources. And we should let a member delete MemberClusterAnnounce in leader, and other resources it creates in leader. The leader may auto-delete member resources after MemberClusterAnnounce is deleted too. |
Yeah, I agree with you that we should delete MC Controller auto-created resources only. The stale resources AntreaClusterNetworkPolicy/multi-cluster Service CRs I referred in previous comment are auto-created by member antrea-mc-controller. They will be left over if users try to run The AntreaClusterNetworkPolicy with annotation For now, we already have a stale controller to clean up the member's MemberClusterAnnounce in the leader cluster if the last update timestamp is over 5 minutes (but only one time during antrea-mc-controller launch). A member controller will also delete the MemberClusterAnnounce when a ClusterSet CR is deleted but not other resources. If we also delete other resources when a ClusterSet CR is deleted, a problem is the existing exported resources won't be added back automatically once a ClusterSet CR is created again. I will add the logic to auto-delete member resources after MemberClusterAnnounce is deleted in the leader. I will double check if there is more proper way to handle resources cleanup. Thanks for the suggestions. |
@jianjuns I updated the replied comment above, Thanks. |
@luolanzone : we should definitely delete all auto-created resources in a local cluster when a ClusterSet is deleted; and when MemberClusterAnnounce is deleted we should delete all resources created for a member too. And I am not sure we should auto-delete MemberClusterAnnounce when a member disconnects, at least not after just 5 mins (if we want to keep that logic, I would change to 1 day).
I do not understand this. What resources you mean here? In leader or member? In any case, we should first that and recreate the resources when the ClusterSet is created (or say ClusterSet can be created after resources to export). |
5df789e
to
02e95d3
Compare
02e95d3
to
acf1846
Compare
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
acf1846
to
b202ce8
Compare
/test-multicluster-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ClusterSet is deleted in either member and leader, we should auto-delete all generated resources in the local cluster. Maybe you want a separate PR for that?
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/leader/member_cleanup_controller.go
Outdated
Show resolved
Hide resolved
I am working on a PR #5438 to clean up all resources in the member and leader with a guide and antctl. But it's more about clean up Antrea Multi-cluster in the leader and member. Do you mean to clean up all generated resources when ClusterSet CR is deleted? I will check this part. |
Yes, I feel MC Controller should auto-delete all CRs it created in the local cluster when the ClusterSet CR is deleted, and MC Controller in the member should also delete MemberClusterAnnounce in the leader. |
Just to add: I am not against to the antctl command approach. Just feel we can still have MC Controller auto-cleanup, and antctl can be a backup solution. |
582452f
to
c380278
Compare
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/local_mc_resources_cleanup.go
Outdated
Show resolved
Hide resolved
2779c52
to
48e39f5
Compare
cd0f72d
to
b711084
Compare
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
b711084
to
866b156
Compare
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
866b156
to
5791036
Compare
1. Split existing stale controller into leader and member folders. 2. The new stale controller in the leader cluster will do following things: * Check MemberClusterAnnounce periodically in the leader cluster and delete the stale CR if its last timestamp annotation `touch-ts` is over 24 hours. * Clean up all corresponding ResourceExports when a MemberClusterAnnounce is deleted. * Clean up stale ResourceExports if there is no existing MemberClusterAnnounce when the controller is started. * Clean up all MemberClusterAnnounces and corresponding ResourceExports when the controller is started with no ClusterSet CR. 3. The ClusterSet controller in leader will remove all remaining ResourceExports and MemberClusterAnnounces when the ClusterSet CR is deleted in the leader cluster. 4. The new stale controller in the member cluster will do following things: * Clean up any stale resources when the ClusterSet is ready. * Clean up any stale resources when the controller is restarted. 5. The ClusterSet controller will be responsible to remove all imported and exported resources for the member cluster when the ClusterSet CR is deleted. Signed-off-by: Lan Luo <[email protected]>
5791036
to
bc26130
Compare
/test-multicluster-e2e |
// Handle create or update | ||
|
||
newLeader := clusterSet.Spec.Leaders[0] | ||
if r.installedLeader.clusterID == newLeader.ClusterID && r.installedLeader.serverUrl == newLeader.Server && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found we missed a case that ClusterSet changed (ClusterID and/ro ClusterSet ID changed) but leader does not change, in which case we should still update r.clusterID
and clusterSetID
.
For now, how about let us go this way:
- If either clusterID or clusterSetID changes, it means ClusterSet is recreated, and we should do cleanup with
cleanUpResources
, and alsocreateRemoteCommonArea
. - If just leader attributes change, we do not do cleanup, but just stop and recreate CommonArea with
createRemoteCommonArea
.
Later we can revisit if we should disallow leader changes or maybe also do cleanup in some cases (like leader ClusterID or serverUrl change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see you change this.
I think we should do something like:
clusterSetCreated = r.clusterID != common.ClusterID(clusterSet.Spec.ClusterID) || r.clusterSetID != common.ClusterSetID(clusterSet.Name)`
leaderChanged = r.installedLeader.clusterID != newLeader.ClusterID || r.installedLeader.serverUrl != newLeader.Server || r.installedLeader.secretName != newLeader.Secret`
if !clusterSetCreated && !leaderChanged {
return nil
}
if clusterSetCreated {
// line 138 - 161
}
return r.createRemoteCommonArea(clusterSet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, this is more clearer, refined. thanks.
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/stale_controller.go
Outdated
Show resolved
Hide resolved
|
||
go func() { | ||
for range c.commonAreaCreationCh { | ||
retry.OnError(common.CleanUpRetry, func(err error) bool { return true }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one issue with retry.OnError
is that if a ClusterSet is recreated before the previous retry is done, we can have two retries ongoing the same time? But probably let us look at a better solution in a follow-up PR.
/test-multicluster-e2e |
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
r.clusterSetConfig = clusterSet.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw seems only clusterSet.Generation
is really useful to be saved. But then I got another question - probably we should save generation before if !clusterSetCreated && !leaderChanged { return nil }
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, moved it.
79604a0
to
57448e3
Compare
clusterSetCreated = r.clusterID != common.ClusterID(clusterSet.Spec.ClusterID) || r.clusterSetID != common.ClusterSetID(clusterSet.Name) | ||
leaderChanged := r.installedLeader.clusterID != newLeader.ClusterID || r.installedLeader.serverUrl != newLeader.Server || | ||
r.installedLeader.secretName != newLeader.Secret | ||
r.clusterSetConfig = clusterSet.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later we may refactor the code to just save Generation, or simply get the ClusterSet generation in updateStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will go through this part and refine in next release.
/test-multicluster-e2e |
There is a doc markdown lint failure, but it's because of |
Yes, I saw the failure in other PRs too. Let us rootcause and fix it. |
if err = staleController.SetupWithManager(mgr); err != nil { | ||
return fmt.Errorf("error creating StaleResCleanupController: %v", err) | ||
} | ||
go staleController.RunPeriodically(stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, Run
is a common name for such long running routines, why adding a suffix for this case alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me change it back, there was another function named RunOnce(), so I changed this name before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
Steps: 15, | ||
Duration: 500 * time.Millisecond, | ||
Factor: 2.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum wait duration could become 0.5s*2^15 = 4h+, is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's too long, let me change it considering 4hrs might be meaningless to retry that long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 12 which is 30+mins.
for _, resExport := range resourceExports.Items { | ||
// The AntreaClusterNetworkPolicy kind of ResourceExport is created in the leader directly | ||
// without a ClusterID info. It's not owned by any member cluster. | ||
if resExport.Spec.Kind != constants.AntreaClusterNetworkPolicyKind && !existingMemberClusterIDs.Has(resExport.Spec.ClusterID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the code, but should the first expression use ==
? Otherwise non AntreaClusterNetworkPolicy will be collected by staleResExports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we did this to skip AntreaClusterNetworkPolicy ResourceExport because these kind of ResourceExport are created by user, we don't want to clean up them automatically.
multicluster/controllers/multicluster/leader/stale_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
57448e3
to
870f345
Compare
/test-multicluster-e2e |
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
multicluster/controllers/multicluster/member/clusterset_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Lan Luo <[email protected]>
870f345
to
94de96c
Compare
/test-multicluster-e2e |
// MemberClusterAnnounce could be kept in the leader cluster, if antrea-mc-controller crashes after the failure. | ||
// Leader cluster will delete the stale MemberClusterAnnounce with a garbage collection mechanism in this case. | ||
return ctrl.Result{}, fmt.Errorf("failed to delete MemberClusterAnnounce in the leader cluster: %v", err) | ||
clusterSetNotFound = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should check if the ClusterSet matches r.clusterSetID?
delete the stale CR if its last timestamp annotation
touch-ts
is over 24 hours.the controller is started.
is started with no ClusterSet CR.
MemberClusterAnnounces when the ClusterSet CR is deleted in the leader cluster.
resources for the member cluster when the ClusterSet CR is deleted.