-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs stuck in "running" state #478
Comments
Hi @volodymyrss, interesting observations! Is the network issue related to pulling of images (i.e. in user jobs), or even just accessing the broker (i.e. even between REANA infrastructure pods themselves)? In order to hard delete jobs from the administrator point of view, if you want to get rid of everything that is running, e.g. in a personal deployment, I am usually doing |
Hi @tiborsimko
It's due to access to
Yes, that's what I did, sorry if I was unclear. Just deleting pods leads to them being recreated from jobs. But, if I delete jobs (kubectl delete jobs ...), reana-workflow-controller crashes (continiously, until back-off) since it gets messages from reana-message-broker about some jobs it can not find in the kubernetes cluster. If I also delete reana-message-broker pod (and it is recreated) the messages are apparently forgotten and all works well. Except that I have now some 1k of jobs seen e.g. in reana-client in |
Thanks for the detailed description, we'll look into improving the resilience of components in these cases. Regarding deleting workflows stuck in a false "running" status, there is no For example to flip all run numbers of "myanalysis" workflow from false "running" status to "stopped" status, you can do: $ kubectl exec -i -t deployment/reana-db -- psql -U reana
psql> SET search_path to __reana, public;
psql> UPDATE workflow SET status='stopped' WHERE name='myanalysis' AND status='running';
psql> \q If you need just some run numbers, you can add After flipping the regular reana-client delete command will work:
This will wipe them all out even from the database and the workspace. Will this do? Otherwise we could perhaps introduce a |
That helps, thanks a lot! Although it takes a while to delete, few seconds per run. Is it normal? I will follow future updates more closely, in case there will be some improvement on resilience. But it is not blocking for me, as I long as can detect the situation and clean-up from it. |
Hi @volodymyrss, it would be useful to have a look at the failed
Unless the workflow controller deployment was recreated the logs should be there. |
Hi @diegodelemos , thanks for the advice! Yeah, I looked there, it said something along the lines of "pod does not exist", in processing some message, and that's how I deduced that:
And that's why I decided to recreate both reana-message-broker and reana-workflow-controller. Maybe recreating just reana-message-broker would have done the same, not sure. But since I actually recreated both to address the situation, and these logs happen to be not aggregated, I do not have them anymore, sorry! I will provide the exact log message when it happens again, if you want to keep this issue open? |
In another deployment with a similar NotReady issues, I saw:
and, while handling this exception:
Not exactly the same, but may point to similar exception handling situations... |
* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).
@diegodelemos Saw one more situation where the run batch pod was in a File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 237, in GET
return self.request("GET", url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Thu, 11 Feb 2021 14:48:42 GMT', 'Content-Length': '253'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"workflow-engine\" in pod \"reana-run-batch-705dcf83-1bcb-4b65-972a-783f29c02ce9-tpjln\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400} and:
|
* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).
* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).
* Currently if an exception would be raised during workflow status update, the changes would not be saved in DB, causing for example inconsistencies if calls to Kubernetes fail for some reason, leaving workflows in running state forever (addresses reanahub/reana#478).
Closing the issue as the core of the problem has been tackled and it will be soon released in REANA |
thanks! |
Hello!
We have been using REANA more recently (in some relation to the example I provide before), very useful! We got new kubernetes cluster so many things are better.
But I find repeatedly one issue: occasionally, after some network access issues, jobs are stuck, remain in running state in reana.
While the "reana-run-job-" pods do not exist, and the "reana-run-batch-" are in NotReady state.
If in this case I delete the Jobs for both "reana-run-job-" and "reana-run-batch-", workflow controller crashes. But after restart of message broker everything is normal, except REANA jobs are in running state and can not be deleted.
Surely this a bit unconventional recovery, but I did not find any other way, at least it frees up the cluster.
Is there a better way to recover, and otherwise get rid of the "running" jobs which are not actually running?
Thanks!
Volodymyr
The text was updated successfully, but these errors were encountered: