jobs stuck in "running" state #478

volodymyrss · 2021-02-10T19:51:49Z

Hello!

We have been using REANA more recently (in some relation to the example I provide before), very useful! We got new kubernetes cluster so many things are better.

But I find repeatedly one issue: occasionally, after some network access issues, jobs are stuck, remain in running state in reana.
While the "reana-run-job-" pods do not exist, and the "reana-run-batch-" are in NotReady state.

If in this case I delete the Jobs for both "reana-run-job-" and "reana-run-batch-", workflow controller crashes. But after restart of message broker everything is normal, except REANA jobs are in running state and can not be deleted.

Surely this a bit unconventional recovery, but I did not find any other way, at least it frees up the cluster.

Is there a better way to recover, and otherwise get rid of the "running" jobs which are not actually running?

Thanks!

Volodymyr

tiborsimko · 2021-02-10T20:28:33Z

Hi @volodymyrss, interesting observations! Is the network issue related to pulling of images (i.e. in user jobs), or even just accessing the broker (i.e. even between REANA infrastructure pods themselves)?

In order to hard delete jobs from the administrator point of view, if you want to get rid of everything that is running, e.g. in a personal deployment, I am usually doing kubectl delete jobs --all. This should remove all jobs pods as well as all batch workflows pods.

volodymyrss · 2021-02-10T20:40:43Z

Hi @tiborsimko

Hi @volodymyrss, interesting observations! Is the network issue related to pulling of images (i.e. in user jobs), or even just accessing the broker (i.e. even between REANA infrastructure pods themselves)?

It's due to access to reana-shared-persistent-volume, mounted from NFS provisioner as ReadWriteMany. The difference I've got with a new cluster is that while it is still in UNIGE it's further away, and network issues happen along the way. I guess might help to host NFS server in the same cluster.

In order to hard delete jobs from the administrator point of view, if you want to get rid of everything that is running, e.g. in a personal deployment, I am usually doing kubectl delete jobs --all. This should remove all jobs pods as well as all batch workflows pods.

Yes, that's what I did, sorry if I was unclear. Just deleting pods leads to them being recreated from jobs.

But, if I delete jobs (kubectl delete jobs ...), reana-workflow-controller crashes (continiously, until back-off) since it gets messages from reana-message-broker about some jobs it can not find in the kubernetes cluster.

If I also delete reana-message-broker pod (and it is recreated) the messages are apparently forgotten and all works well.

Except that I have now some 1k of jobs seen e.g. in reana-client in running state, and I can not delete them, since they are "running".
Surely, I can find a way to purge them from database, but I wonder if there is some "force delete" option.

tiborsimko · 2021-02-10T21:06:37Z

Thanks for the detailed description, we'll look into improving the resilience of components in these cases.

Regarding deleting workflows stuck in a false "running" status, there is no --force option for this unfortunately. The best is to flip the status of these workflows in the database to something deletable, and then to use reana-client on them.

For example to flip all run numbers of "myanalysis" workflow from false "running" status to "stopped" status, you can do:

$ kubectl exec -i -t deployment/reana-db -- psql -U reana
psql> SET search_path to __reana, public; 
psql> UPDATE workflow SET status='stopped' WHERE name='myanalysis' AND status='running';
psql> \q

If you need just some run numbers, you can add ... AND run_number=... to the SQL query.

After flipping the regular reana-client delete command will work:

$ reana-client delete -w myanalysis --include-all-runs --include-workspace --include-records

This will wipe them all out even from the database and the workspace.

Will this do? Otherwise we could perhaps introduce a --force flag.

volodymyrss · 2021-02-11T07:52:32Z

That helps, thanks a lot!

Although it takes a while to delete, few seconds per run. Is it normal?

I will follow future updates more closely, in case there will be some improvement on resilience.

But it is not blocking for me, as I long as can detect the situation and clean-up from it.

diegodelemos · 2021-02-11T09:07:46Z

Hi @volodymyrss, it would be useful to have a look at the failed job-status-consumer container logs inside workflow controller:

$ kubectl logs reana-workflow-controller-xxxxx-yyy job-status-consumer --previous

Unless the workflow controller deployment was recreated the logs should be there.

volodymyrss · 2021-02-11T09:52:41Z

Hi @diegodelemos , thanks for the advice!

Yeah, I looked there, it said something along the lines of "pod does not exist", in processing some message, and that's how I deduced that:

reana-workflow-controller crashes (continiously, until back-off) since it gets messages from reana-message-broker about some jobs it can not find in the kubernetes cluster.

And that's why I decided to recreate both reana-message-broker and reana-workflow-controller. Maybe recreating just reana-message-broker would have done the same, not sure.

But since I actually recreated both to address the situation, and these logs happen to be not aggregated, I do not have them anymore, sorry!

I will provide the exact log message when it happens again, if you want to keep this issue open?

tiborsimko · 2021-02-11T11:28:09Z

In another deployment with a similar NotReady issues, I saw:

   File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
      1     raise ApiException(http_resp=r)
      2 kubernetes.client.rest.ApiException: (500)
      3 Reason: Internal Server Error
      4 HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'REDACTED', 'Content-Length': '345'})
      5 HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: Authorization error (user=kubernetes, verb=get, resource=nodes, subresource=proxy)","reason":"InternalError","details":{"causes":[{"message":"Authorization error (user=kubernetes,                ↳ verb=get, resource=nodes, subresource=proxy)"}]},"code":500}

and, while handling this exception:

 File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 116, in _update_workflow_status
      8     _delete_workflow_engine_pod(workflow)
      7   File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 250, in _delete_workflow_engine_pod
      6     raise REANAWorkflowControllerError(
      5 reana_workflow_controller.errors.REANAWorkflowControllerError: Workflow engine pod cound not be deleted (500)
      4 Reason: Internal Server Error
      3 HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'REDACTED', 'Content-Length': '345'})
      2 HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: Authorization error (user=kubernetes, verb=get, resource=nodes, subresource=proxy)","reason":"InternalError","details":{"causes":[{"message":"Authorization error (user=kubernetes,                ↳ verb=get, resource=nodes, subresource=proxy)"}]},"code":500}

Not exactly the same, but may point to similar exception handling situations...

* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).

tiborsimko · 2021-02-16T12:18:22Z

@diegodelemos Saw one more situation where the run batch pod was in a NotReady status and where r-w-controller's job-consumer saw an exception:

  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
    return self.request("GET", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Thu, 11 Feb 2021 14:48:42 GMT', 'Content-Length': '253'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"workflow-engine\" in pod \"reana-run-batch-705dcf83-1bcb-4b65-972a-783f29c02ce9-tpjln\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}

and:

  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 116, in _update_workflow_status
    _delete_workflow_engine_pod(workflow)
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 250, in _delete_workflow_engine_pod
    raise REANAWorkflowControllerError(
reana_workflow_controller.errors.REANAWorkflowControllerError: Workflow engine pod cound not be deleted (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Thu, 11 Feb 2021 14:48:42 GMT', 'Content-Length': '253'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"workflow-engine\" in pod \"reana-run-batch-705dcf83-1bcb-4b65-972a-783f29c02ce9-tpjln\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}

* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).

diegodelemos · 2021-02-23T09:08:15Z

Closing the issue as the core of the problem has been tackled and it will be soon released in REANA 0.7.3. A remaining corner case has been documented in reanahub/reana-workflow-controller#363 and will be tackled in the future.

volodymyrss · 2021-02-23T09:10:08Z

thanks!

tiborsimko added the community/astrooda label Feb 11, 2021

diegodelemos mentioned this issue Feb 12, 2021

consumer: daemon to not die on errors reanahub/reana-workflow-controller#360

Merged

diegodelemos added this to the 2021-02-24 milestone Feb 22, 2021

diegodelemos mentioned this issue Feb 23, 2021

workflow-run-manager: create new pending status reanahub/reana-workflow-controller#363

Closed

diegodelemos closed this as completed Feb 23, 2021

diegodelemos mentioned this issue Mar 2, 2021

status-report: better/more specific summary for stuck workflows reanahub/reana-server#342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs stuck in "running" state #478

jobs stuck in "running" state #478

volodymyrss commented Feb 10, 2021

tiborsimko commented Feb 10, 2021

volodymyrss commented Feb 10, 2021

tiborsimko commented Feb 10, 2021

volodymyrss commented Feb 11, 2021

diegodelemos commented Feb 11, 2021

volodymyrss commented Feb 11, 2021

tiborsimko commented Feb 11, 2021

tiborsimko commented Feb 16, 2021

diegodelemos commented Feb 23, 2021

volodymyrss commented Feb 23, 2021

jobs stuck in "running" state #478

jobs stuck in "running" state #478

Comments

volodymyrss commented Feb 10, 2021

tiborsimko commented Feb 10, 2021

volodymyrss commented Feb 10, 2021

tiborsimko commented Feb 10, 2021

volodymyrss commented Feb 11, 2021

diegodelemos commented Feb 11, 2021

volodymyrss commented Feb 11, 2021

tiborsimko commented Feb 11, 2021

tiborsimko commented Feb 16, 2021

diegodelemos commented Feb 23, 2021

volodymyrss commented Feb 23, 2021