Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs stuck in "running" state #478

Closed
volodymyrss opened this issue Feb 10, 2021 · 10 comments
Closed

jobs stuck in "running" state #478

volodymyrss opened this issue Feb 10, 2021 · 10 comments

Comments

@volodymyrss
Copy link

Hello!

We have been using REANA more recently (in some relation to the example I provide before), very useful! We got new kubernetes cluster so many things are better.

But I find repeatedly one issue: occasionally, after some network access issues, jobs are stuck, remain in running state in reana.
While the "reana-run-job-" pods do not exist, and the "reana-run-batch-" are in NotReady state.

If in this case I delete the Jobs for both "reana-run-job-" and "reana-run-batch-", workflow controller crashes. But after restart of message broker everything is normal, except REANA jobs are in running state and can not be deleted.

Surely this a bit unconventional recovery, but I did not find any other way, at least it frees up the cluster.

Is there a better way to recover, and otherwise get rid of the "running" jobs which are not actually running?

Thanks!

Volodymyr

@tiborsimko
Copy link
Member

Hi @volodymyrss, interesting observations! Is the network issue related to pulling of images (i.e. in user jobs), or even just accessing the broker (i.e. even between REANA infrastructure pods themselves)?

In order to hard delete jobs from the administrator point of view, if you want to get rid of everything that is running, e.g. in a personal deployment, I am usually doing kubectl delete jobs --all. This should remove all jobs pods as well as all batch workflows pods.

@volodymyrss
Copy link
Author

Hi @tiborsimko

Hi @volodymyrss, interesting observations! Is the network issue related to pulling of images (i.e. in user jobs), or even just accessing the broker (i.e. even between REANA infrastructure pods themselves)?

It's due to access to reana-shared-persistent-volume, mounted from NFS provisioner as ReadWriteMany. The difference I've got with a new cluster is that while it is still in UNIGE it's further away, and network issues happen along the way. I guess might help to host NFS server in the same cluster.

In order to hard delete jobs from the administrator point of view, if you want to get rid of everything that is running, e.g. in a personal deployment, I am usually doing kubectl delete jobs --all. This should remove all jobs pods as well as all batch workflows pods.

Yes, that's what I did, sorry if I was unclear. Just deleting pods leads to them being recreated from jobs.

But, if I delete jobs (kubectl delete jobs ...), reana-workflow-controller crashes (continiously, until back-off) since it gets messages from reana-message-broker about some jobs it can not find in the kubernetes cluster.

If I also delete reana-message-broker pod (and it is recreated) the messages are apparently forgotten and all works well.

Except that I have now some 1k of jobs seen e.g. in reana-client in running state, and I can not delete them, since they are "running".
Surely, I can find a way to purge them from database, but I wonder if there is some "force delete" option.

@tiborsimko
Copy link
Member

Thanks for the detailed description, we'll look into improving the resilience of components in these cases.

Regarding deleting workflows stuck in a false "running" status, there is no --force option for this unfortunately. The best is to flip the status of these workflows in the database to something deletable, and then to use reana-client on them.

For example to flip all run numbers of "myanalysis" workflow from false "running" status to "stopped" status, you can do:

$ kubectl exec -i -t deployment/reana-db -- psql -U reana
psql> SET search_path to __reana, public; 
psql> UPDATE workflow SET status='stopped' WHERE name='myanalysis' AND status='running';
psql> \q

If you need just some run numbers, you can add ... AND run_number=... to the SQL query.

After flipping the regular reana-client delete command will work:

$ reana-client delete -w myanalysis --include-all-runs --include-workspace --include-records

This will wipe them all out even from the database and the workspace.

Will this do? Otherwise we could perhaps introduce a --force flag.

@volodymyrss
Copy link
Author

That helps, thanks a lot!

Although it takes a while to delete, few seconds per run. Is it normal?

I will follow future updates more closely, in case there will be some improvement on resilience.

But it is not blocking for me, as I long as can detect the situation and clean-up from it.

@diegodelemos
Copy link
Member

Hi @volodymyrss, it would be useful to have a look at the failed job-status-consumer container logs inside workflow controller:

$ kubectl logs reana-workflow-controller-xxxxx-yyy job-status-consumer --previous

Unless the workflow controller deployment was recreated the logs should be there.

@volodymyrss
Copy link
Author

Hi @diegodelemos , thanks for the advice!

Yeah, I looked there, it said something along the lines of "pod does not exist", in processing some message, and that's how I deduced that:

reana-workflow-controller crashes (continiously, until back-off) since it gets messages from reana-message-broker about some jobs it can not find in the kubernetes cluster.

And that's why I decided to recreate both reana-message-broker and reana-workflow-controller. Maybe recreating just reana-message-broker would have done the same, not sure.

But since I actually recreated both to address the situation, and these logs happen to be not aggregated, I do not have them anymore, sorry!

I will provide the exact log message when it happens again, if you want to keep this issue open?

@tiborsimko
Copy link
Member

In another deployment with a similar NotReady issues, I saw:

   File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
      1     raise ApiException(http_resp=r)
      2 kubernetes.client.rest.ApiException: (500)
      3 Reason: Internal Server Error
      4 HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'REDACTED', 'Content-Length': '345'})
      5 HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: Authorization error (user=kubernetes, verb=get, resource=nodes, subresource=proxy)","reason":"InternalError","details":{"causes":[{"message":"Authorization error (user=kubernetes,                ↳ verb=get, resource=nodes, subresource=proxy)"}]},"code":500}

and, while handling this exception:

 File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 116, in _update_workflow_status
      8     _delete_workflow_engine_pod(workflow)
      7   File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 250, in _delete_workflow_engine_pod
      6     raise REANAWorkflowControllerError(
      5 reana_workflow_controller.errors.REANAWorkflowControllerError: Workflow engine pod cound not be deleted (500)
      4 Reason: Internal Server Error
      3 HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'REDACTED', 'Content-Length': '345'})
      2 HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: Authorization error (user=kubernetes, verb=get, resource=nodes, subresource=proxy)","reason":"InternalError","details":{"causes":[{"message":"Authorization error (user=kubernetes,                ↳ verb=get, resource=nodes, subresource=proxy)"}]},"code":500}

Not exactly the same, but may point to similar exception handling situations...

diegodelemos pushed a commit to diegodelemos/reana-workflow-controller that referenced this issue Feb 12, 2021
* Currently if an exception would be raised during workflow status
  update, the changes would not be saved in DB, causing for example
  inconsistencies if calls to Kubernetes fail for some reason,
  leaving workflows in running state forever
  (addresses reanahub/reana#478).
@tiborsimko
Copy link
Member

@diegodelemos Saw one more situation where the run batch pod was in a NotReady status and where r-w-controller's job-consumer saw an exception:

  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
    return self.request("GET", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Thu, 11 Feb 2021 14:48:42 GMT', 'Content-Length': '253'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"workflow-engine\" in pod \"reana-run-batch-705dcf83-1bcb-4b65-972a-783f29c02ce9-tpjln\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}

and:

  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 116, in _update_workflow_status
    _delete_workflow_engine_pod(workflow)
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 250, in _delete_workflow_engine_pod
    raise REANAWorkflowControllerError(
reana_workflow_controller.errors.REANAWorkflowControllerError: Workflow engine pod cound not be deleted (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Thu, 11 Feb 2021 14:48:42 GMT', 'Content-Length': '253'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"workflow-engine\" in pod \"reana-run-batch-705dcf83-1bcb-4b65-972a-783f29c02ce9-tpjln\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}

diegodelemos pushed a commit to diegodelemos/reana-workflow-controller that referenced this issue Feb 17, 2021
* Currently if an exception would be raised during workflow status
  update, the changes would not be saved in DB, causing for example
  inconsistencies if calls to Kubernetes fail for some reason,
  leaving workflows in running state forever
  (addresses reanahub/reana#478).
diegodelemos pushed a commit to diegodelemos/reana-workflow-controller that referenced this issue Feb 17, 2021
* Currently if an exception would be raised during workflow status
  update, the changes would not be saved in DB, causing for example
  inconsistencies if calls to Kubernetes fail for some reason,
  leaving workflows in running state forever
  (addresses reanahub/reana#478).
@diegodelemos diegodelemos added this to the 2021-02-24 milestone Feb 22, 2021
diegodelemos pushed a commit to diegodelemos/reana-workflow-controller that referenced this issue Feb 22, 2021
* Currently if an exception would be raised during workflow status
  update, the changes would not be saved in DB, causing for example
  inconsistencies if calls to Kubernetes fail for some reason,
  leaving workflows in running state forever
  (addresses reanahub/reana#478).
@diegodelemos
Copy link
Member

Closing the issue as the core of the problem has been tackled and it will be soon released in REANA 0.7.3. A remaining corner case has been documented in reanahub/reana-workflow-controller#363 and will be tackled in the future.

@volodymyrss
Copy link
Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants