Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update condition before calling delete and decrementing counter #1074

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

MatousJobanek
Copy link
Contributor

fix for flakiness in space_autocompletion_test.go an example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/codeready-toolchain_toolchain-e2e/1033/pull-ci-codeready-toolchain-toolchain-e2e-master-e2e/1823343090958602240

what happened:

  1. Space was deleted as part of the post-test cleanup
  2. after calling delete on NSTemplateSet (and decrementing the Spacen counter), the controller wanted to update Space's status, but this failed for a conflict
  3. new reconciliation was triggered, but the cache wasn't updated so it didn't see NSTemplateSet as already being deleted
  4. the Space controller tried to delete it one more time and decremented the counter again
  5. As a result, the number of provisioned Spaces was 0, but the e2e test expects that at least one Space should be present. It sets the max number of Spaces (for member1) to the same value as is in the ToolchainStatus (in this case to 0)
  6. Since 0 means no limit, then the Space controller provisioned the next Space to member1, but member1 was supposed to be full, and the Space was expected to be provisoned in member2.

Thus, to avoid similar problems in the future, I updated the controller so it updates the status before calling delete on NSTemplateSet and decrementing the counter.

Copy link
Contributor

@xcoulon xcoulon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! Just left a suggestion

@@ -474,6 +478,10 @@ func (r *Reconciler) ensureSpaceDeletion(ctx context.Context, space *toolchainv1
logger := log.FromContext(ctx)

logger.Info("terminating Space")
if err := r.setStatusTerminating(ctx, space); err != nil {
logger.Error(err, "error updating status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about wrapping the underlying error and returning the new err? Otherwise, we'll get 2 lines in logs, which is harder to track :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I don't follow.
What do you propose to do differently compared to what we currently have in the controller? Do you want to drop the line 482 (and wrap the error with the message instead)? If yes, then this change doesn't affect only this line, but all other lines in the whole controller and most likely in all other controllers as well. I only kept the consistency with the current code.
TBH, I'm fine with removing the duplication in the logs, but this is much bigger effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's address that in a separate PR later, then

Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and great investigation!

Thanks for describing the scenario. I have only few minor comments.

Also I have some concerns about the 0 value meaning infinity , should we change that and use something like -1 when we want to configure a member cluster with no space limits ?

controllers/space/space_controller_test.go Outdated Show resolved Hide resolved
controllers/space/space_controller_test.go Outdated Show resolved Hide resolved
@MatousJobanek
Copy link
Contributor Author

/retest
infra

Copy link

openshift-ci bot commented Aug 16, 2024

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm label Aug 16, 2024
@mfrancisc
Copy link
Contributor

interestingly is still failing on the same test apparently 🤔

@mfrancisc
Copy link
Contributor

Also I'm not sure I understand why the first cluster has the . in the suffix while the second one has no dot.

        expected:
        ----
        member-ci-op-cipyilkv-1de72.devsandboxci.devcluster.openshift2----
        actual:
        ----
        member-ci-op-cipyilkv-1de72.devsandboxci.devcluster.openshift.1----

Copy link
Contributor

@rajivnathan rajivnathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great investigation and explanation 👍

controllers/space/space_controller_test.go Outdated Show resolved Hide resolved
controllers/space/space_controller_test.go Outdated Show resolved Hide resolved
Copy link

openshift-ci bot commented Aug 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MatousJobanek, mfrancisc, rajivnathan, xcoulon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [MatousJobanek,mfrancisc,rajivnathan,xcoulon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MatousJobanek
Copy link
Contributor Author

there is apparently still a problem with the cache not being updated with the NSTemplateSet CR that are already being deleted (the controller called "delete" on them before)

{"level":"info","ts":"2024-08-19T10:03:19.126Z","msg":"reconciling Space","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"0830505d-f40b-469f-89c9-c4665c689331"}
{"level":"info","ts":"2024-08-19T10:03:19.126Z","msg":"terminating Space","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"0830505d-f40b-469f-89c9-c4665c689331"}
{"level":"info","ts":"2024-08-19T10:03:19.133Z","msg":"deleting the NSTemplateSet resource","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"0830505d-f40b-469f-89c9-c4665c689331"}
{"level":"info","ts":"2024-08-19T10:03:19.140Z","msg":"decremented Spaces count","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"0830505d-f40b-469f-89c9-c4665c689331","clusterName":"member-ci-op-d1ypgdqs-1de72.devsandboxci.devcluster.openshift.1","value":1}
{"level":"info","ts":"2024-08-19T10:03:19.140Z","msg":"deleted the NSTemplateSet resource","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"0830505d-f40b-469f-89c9-c4665c689331"}
{"level":"info","ts":"2024-08-19T10:03:19.140Z","msg":"reconciling Space","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"ffdb3e52-0f2c-4f57-8287-30bc6903167f"}
{"level":"info","ts":"2024-08-19T10:03:19.140Z","msg":"terminating Space","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"ffdb3e52-0f2c-4f57-8287-30bc6903167f"}
{"level":"info","ts":"2024-08-19T10:03:19.140Z","msg":"deleting the NSTemplateSet resource","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"ffdb3e52-0f2c-4f57-8287-30bc6903167f"}
{"level":"info","ts":"2024-08-19T10:03:19.151Z","msg":"decremented Spaces count","controller":"space","controllerGroup":"toolchain.dev.openshift.com","controllerKind":"Space","Space":{"name":"testautomaticclusterassignmentsetlowcapaxf9vz","namespace":"toolchain-host-19093033"},"namespace":"toolchain-host-19093033","name":"testautomaticclusterassignmentsetlowcapaxf9vz","reconcileID":"ffdb3e52-0f2c-4f57-8287-30bc6903167f","clusterName":"member-ci-op-d1ypgdqs-1de72.devsandboxci.devcluster.openshift.1","value":0}

Copy link

sonarcloud bot commented Sep 17, 2024

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.17%. Comparing base (a5cde1a) to head (16cd3f2).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1074      +/-   ##
==========================================
+ Coverage   85.10%   85.17%   +0.06%     
==========================================
  Files          55       55              
  Lines        5041     5044       +3     
==========================================
+ Hits         4290     4296       +6     
+ Misses        580      578       -2     
+ Partials      171      170       -1     
Files with missing lines Coverage Δ
controllers/space/space_controller.go 87.63% <100.00%> (+0.74%) ⬆️

Copy link

sonarcloud bot commented Nov 5, 2024

Copy link

openshift-ci bot commented Nov 5, 2024

@MatousJobanek: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e aaabbba link true /test e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants