wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

chizhg · 2020-01-06T19:49:06Z

In

Line 133 in 97901db

function wait_until_pods_running() {

, wait_until_pods_running will only succeed if all pods in the given namespace are in Running or Completed state.

But since k8s has some retry logic, e.g. K8s Job can create a new pod if there is an error, one error pod does not necessarily mean the Job fails - https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_serving/6440/pull-knative-serving-integration-tests/1214219978266382337 is an example. In such scenario wait_until_pods_running will return an error that is not necessarily true.

This function should be general enough to consider and rule out error pods that are not related to the final state of the owners, e.g.

For Deployments, all pods should be X/X Running
For Jobs, it should depend on the success criteria
...

FYI @mattmoor

The text was updated successfully, but these errors were encountered:

adrcunha · 2020-01-06T23:39:16Z

Please clarify what are the specific requirements for undoubtedly ruling a collection of pods as "running". If the answer is "the requirements depend on what we're waiting to be running", then this function should be removed from library.sh and the required variants implemented in each place it's used.

steuhs · 2020-01-07T00:01:09Z

Alternatively we can delete those failed ones after retry started, we don't need to make the change here

…#1611

chizhg · 2020-03-14T04:31:26Z

We are going to reimplement this function with Go and won't make any incremental changes to this function.

/remove-kind good-first-issue
/lifecycle frozen

github-actions · 2020-07-17T20:42:39Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. \nReopen the issue with /reopen.\nMark the issue as
fresh by adding the comment /remove-lifecycle stale.

chizhg · 2020-07-17T21:05:33Z

/reopen

mattmoor · 2020-09-16T01:36:03Z

Seems to me like this entire function could just be:

kubectl wait pod --for=condition=Ready -n $1 -l '!job-name'

This doesn't properly check jobs, but what we have today is fairly hit or miss.

chizhg · 2020-09-18T00:12:06Z

Thanks @mattmoor ! I have created #2440

chizhg · 2020-09-20T23:33:36Z

#2440 will be partially reverted by #2443, so reopen this issue.

adrcunha added bug Something isn't working kind/good-first-issue Denotes an issue ready for a new contributor. labels Jan 6, 2020

mattmoor added a commit to mattmoor/serving that referenced this issue Jan 7, 2020

Add retry logic internal to the Job to work around knative/test-infra…

847033a

…#1611

knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed kind/good-first-issue Denotes an issue ready for a new contributor. labels Mar 14, 2020

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2020

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2020

chizhg mentioned this issue Sep 18, 2020

Simplify the wait_for_xx bash functions #2440

Merged

knative-prow-robot closed this as completed in #2440 Sep 18, 2020

chizhg reopened this Sep 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

chizhg commented Jan 6, 2020 •

edited

Loading

adrcunha commented Jan 6, 2020

steuhs commented Jan 7, 2020

chizhg commented Mar 14, 2020

github-actions bot commented Jul 17, 2020

chizhg commented Jul 17, 2020

mattmoor commented Sep 16, 2020

chizhg commented Sep 18, 2020

chizhg commented Sep 20, 2020

wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

wait_until_pods_running should be able to rule out error pods that are not related to the final state of the owners #1611

Comments

chizhg commented Jan 6, 2020 • edited Loading

adrcunha commented Jan 6, 2020

steuhs commented Jan 7, 2020

chizhg commented Mar 14, 2020

github-actions bot commented Jul 17, 2020

chizhg commented Jul 17, 2020

mattmoor commented Sep 16, 2020

chizhg commented Sep 18, 2020

chizhg commented Sep 20, 2020

chizhg commented Jan 6, 2020 •

edited

Loading