-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow daemons to exit immediately when the connection terminates? #87
Comments
The more I reflect on leftover daemons, the more trouble I anticipate for a typical To ensure they get the correct data, users need to manually clean up dangling daemons before starting new ones. Few people know how to do this, and although I am trying to help via wlandau/crew.aws.batch#2, none of my own workarounds can be fully automated. |
It might be possible to allow this through an argument at |
That would be such a huge help as far as safety is concerned!
Are you referring to the clause from https://cran.r-project.org/doc/manuals/R-exts.html#Writing-portable-packages?
I think Also, when they say "the user’s R process", I think they envision a single interactive local R session with a backlog of unsaved work from multiple projects. By contrast, a daemon is external and limited in scope.
Yes, I suppose not every block of C code can check |
|
Thanks for the link, that’s really quite interesting! It led me to do more thinking: First, I think you are familiar with how pipe events work, but just to re-cap, we are not actually sending a signal to tell the daemons to exit (this would not be universally reliable for any type of remote connection), rather we are setting it up to raise the signal on itself when the connection is broken. With this in mind, With the open architecture design of At the end of the day, we are talking about the length of an individual So for something like |
I see what you mean. SIGINT allows cleanup to happen and ensures that output storage is not left in a corrupted or unusable state. In that sense, SIGINT is a more graceful and robust solution. On the other hand, I do not think an abrupt SIGKILL is really such a disaster. If a data file is corrupted, all one needs to do is rerun the original code that produces it. (And In contrast, runaway processes are truly catastrophic. On the cloud, they could rapidly burn through tens of thousands of dollars and tank the cloud budgets of entire departments (mine included). The probability of disaster is not small: poorly-written C/C++ code is prevalent in statistical modeling packages, and most of my colleagues who use I work with hundreds of clinical statisticians, and for almost everyone, their skill and awareness do not extend beyond the local R interpreter and RStudio environment. Most of them use On top of that, we are trying to move to a cloud platform with workers on AWS Batch. Last Friday when I implemented job management utilities in I saw your work at shikokuchuo/nanonext#25, and I am super grateful. SIGINT does get us most of the way there. But given the differences in our use cases and priorities, I wonder if it would be possible to let the user choose the signal type and make SIGINT the default. |
Note that in my last reply I said for the intended purpose.
To make it absolutely clear, the intended purpose here is to stop mirai evaluation, to address your initial ask:
If you are concerned about expenses on cloud servers or HPC resources not being released, then the primary tool for addressing failure must be an external monitoring solution, using a method that is officially sanctioned via the cluster manager or API call etc. Anything implemented here may help to minimise the chances of such situations as you identify, but cannot and should not be thought of as a failsafe as there are many other reasons that processes may hang. Having said that, I think your last idea is a good one. SIGINT isn't the only legitimate signal that can be sent. A sophisticated consumer may install custom handlers to be able to differentiate pipe additions or removals from a user interrupt (to pause actions until a reconnect for example). In such a case, it would be up to the user to supply a signal to send and there would not be a default. I need time to think on the implementation, but leave this with me. |
This capability has been implemented in |
@shikokuchuo, I can't thank you enough! Using only the tools available on the host platform, it is almost impossible to automatically control runaway processes in the general case, especially the cloud. Every bit of help from I tested this functionality and built it into development |
You're welcome! |
An update: as I was testing just now, I found out that |
It appears |
If a daemon is running a long task and
daemons(n = 0L)
is called, the daemon continues to run the task and does not exit until the task is finished. A reprex is below.As you explained in #86, this behavior is desirable in many situations: e.g. if a task has intermediate checkpoints and needs a chance to log them. In
targets
pipelines, however, the extra execution time is not useful because the metadata cannot be logged if the central R process is interrupted. In addition, I worry about mounting costs on expensive cloud services. So I think it would be useful to be able to configure the daemons to exit as soon as the TCP connections are broken. c.f. wlandau/crew#141The text was updated successfully, but these errors were encountered: