-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not allow negative requeue times #7589
Conversation
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
3397aa3
to
c21d3fa
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
c21d3fa
to
b6664ab
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/assign @vdemeester @khrm FYI |
pkg/reconciler/taskrun/taskrun.go
Outdated
timeout = 6 * time.Hour | ||
case elapsed < 24*time.Hour: | ||
timeout = 24 * time.Hour | ||
case elapsed < 48*time.Hour: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to document this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @khrm, we need to document these times somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions as to where to document @khrm @vdemeester ? The timeout feature is certainly discussed in the pipelinerun and taskrun API markdown docs. But I'm not sure if we should clutter API level discussion with some lower level controller implementation detail.
And there is not a current "controller implementation details" markdown. And I'm not sure if this warrants the creation of such a thing.
And do we drop some hint of the pros and cons of disabling timeouts?
Pending responses from you all, in the interim, I'll work on a doc update to the pipelinerun/taskrun api doc, where timeout is discussed. For the moment at least, I'll refrain from any disabling timeouts warnings. Perhaps such warning necessitate broader community agreement.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here @khrm @vdemeester is to just use the configured global default as the wait time, or some fraction of the configured global default, regardless of what the elapsed time is, when the user chooses to disable the timeout. That is probably sufficient here, and would allow existing controls as a means for the user to tweak things.
I'm leaning toward switching to that .... but I'll wait to hear from you all before making the change here.
This lends itself to a cleaner explanation in the existing sections for timeout of the pipelineruns.md and taskruns.md, as I realized, when I started tinkering with a doc change.
Also, if we go this path, I know realized we would probably want to update the release note to advertise this, right?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I need to expand the 0 timeout checks to the other finally related timeouts, make sure there are not negative wait times for those as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabemontero If I understand correctly, as it is an internal / implementation detail thing, and very similar to something like the "global resync period" of the controller, which is not documented anywhere, might not make sense to document it ; at least in API / users docs. I agree with that, so maybe just some comments would make sense initially 🙃. I wish we had more "deep technical doc", but as we don't have them today, I don't want to block the PR on this indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have captured it correctly @vdemeester
That said, any thoughts from anyone on my simplification for the what wait time to use that I noted in #7589 (comment) ?
To some degree these staggered wait times I essentially just grabbed from the air doesn't feel ideal.
Also, with these requeues, if the item requeued with wait gets updated, it get reconciled on the update anyway, correct? So just using a constant value perhaps is just simpler / better, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my explanation is not clear, I can submit the alternative as a separate commit so folks can compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
pkg/reconciler/taskrun/taskrun.go
Outdated
timeout = 6 * time.Hour | ||
case elapsed < 24*time.Hour: | ||
timeout = 24 * time.Hour | ||
case elapsed < 48*time.Hour: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @khrm, we need to document these times somehow.
pkg/reconciler/taskrun/taskrun.go
Outdated
timeout = 6 * time.Hour | ||
case elapsed < 24*time.Hour: | ||
timeout = 24 * time.Hour | ||
case elapsed < 48*time.Hour: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabemontero If I understand correctly, as it is an internal / implementation detail thing, and very similar to something like the "global resync period" of the controller, which is not documented anywhere, might not make sense to document it ; at least in API / users docs. I agree with that, so maybe just some comments would make sense initially 🙃. I wish we had more "deep technical doc", but as we don't have them today, I don't want to block the PR on this indeed.
cc @JeromeJu @chitrangpatel @afrittoli |
@vdemeester: once the present PR merges, I will cherry-pick it on top of release-v0.56.x in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Oh it picks just one cherry-pick by comment ? |
/cherry-pick release-v0.50.x |
/cherry-pick release-v0.47.x |
@vdemeester @khrm I have pushed a separate commit that replaces staggered (but arbitrary) set of wait time based on elapsed time and just uses the configured timeout default as the wait time If this approach seems more amenable to everyone, I can cleanup, add a blurb in the docs if desired, and squash appropriately. If the original form is preferred, I'll just drop the commit. Of if there are any other suggestions for wait time? Or perhaps we do not requeue with a wait time in the case timeout has been disabled, and just return nil, and wait for updates to pods/taskruns/pipelineruns result in new reconciliations ? |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
11dea3d
to
8efbdda
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Make sense to me on the new commit, it makes things simpler and "in the hand of the user". We can always refine this later if need be. |
Use of the value of 0 for the taskrun/pipeline timeout, which per https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout for example means timeout is disabled, results in the waitTime passed to the Requeue event to be negative. This had the observed behavior of Requeue'ing immediately, and intense cycles of many reconcilations per second were observed if the TaskRun's/PipelineRun's state did not in fact change. This artificially constrained the peformance of the pipeline controller. This change makes sure the wait time passed to the Requeue is not negative. rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED
8efbdda
to
310041f
Compare
cool thanks @vdemeester .... I've gone ahead and squahsed the commits and cleaned up commented out code also, with this simpler approach, the doc update that you and @khrm mentioned earlier seemed easier, less klunky, so I've gone head and included that as well. PTAL and see if it is appropriate for that section of the API doc. Thanks. |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is better and simpler. I think we can proceed with that. Docs seem to be in the correct place.
the beta-integration test failure seems like an unrelated flake to me, best as I an tell |
/retest |
CI green ready for follow reviews / possible acceptance of change @JeromeJu @chitrangpatel @afrittoli @vdemeester @khrm thanks |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JeromeJu, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/retest |
@vdemeester: new pull request created: #7638 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@vdemeester: #7589 failed to apply on top of branch "release-v0.53.x":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@vdemeester: #7589 failed to apply on top of branch "release-v0.50.x":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@vdemeester: #7589 failed to apply on top of branch "release-v0.47.x":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #7588
/kind bug
Changes
Use of the value of 0 for the taskrun/pipeline timeout, which per https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout for example means timeout is disabled, results in the waitTime passed to the Requeue event to be negative. This had the observed behavior of Requeue'ing immediately, and intense cycles of many reconcilations per second were observed if the TaskRun's/PipelineRun's state did not in fact change. This artificially constrained the peformance of the pipeline controller.
This change makes sure the wait time passed to the Requeue is not negative.
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes