-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dispynode cannot shutdown #181
Comments
Also in dispynode.py, dispynode version: 4.10.5 |
In dispynode.py, dispynode version: 4.10.5 |
I fixed the shutdown by changing dispynode.py, line 730 to: |
Thanks! I will take a look at it over the weekend and commit your fix. |
Giridhar,
I have been code reviewing the dispy package to see if it is suitable for
our planned usage, and I am impressed! The package is extraordinary.
Did you code it yourself? If so, kudos!
I will soon be asking a few questions regarding the best way to discover
and map volatile dispynodes to particular dispyschedulers.
Thank you for such an extensive package.
Well done!
…-Moray King
On Wed, Mar 20, 2019 at 8:05 PM Giridhar Pemmasani ***@***.***> wrote:
Thanks! I will take a look at it over the weekend and commit your fix.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AueuOOHPzPWj-y45kMsTj_wjUUDdjq4Eks5vYuj0gaJpZM4b9d08>
.
|
In my experiments, I get the following error in my code, dispynode2.py (with the shutdown fix): I noticed the function _dispy_job_func is unprepared for 'reply_Q' to be absent. |
Thanks for your comments!
Yes.
Sure. |
That shouldn't happen (i.e., |
'shutdown' should use 'addrinfo' attribute only if it is initialized. Fix for issue #181.
Giridhar,
I have not found a reliable trigger to manifest the bug. My experiments use
OKD Kubernetes pods, which act like separate machines with separate IP
addresses. I have one client pod running the sample compute, one
DispySchedeuler pod, and one DispyNode pod. They use the default ports. I
start the dispynode with the following:
python3.6 dispynode2.py -d --cpus 10 --name Monode
--scheduler_node=cqscheduler --clean --ping_interval=10
--ipv4_udp_multicast
The file dispynode2.py is your dispynode.py with the correction (as
discussed) on line 730. Typically, everything runs successfully.
The run that raised the KeyError on line 193 regarding the missing
'reply_Q' appeared to manifest because a "zombie" was present. There were
files under /tmp/dispy/node/. The bug cleared itself after an hour. Getting
into the bad state likely occurred when I was explicitly killing dispynode
via htop, which I had to do often prior to the fix on line 730.
I did notice the following lines pop the reply_Q off of the globals
dictionary:
Line 193: reply_Q = __dispy_job_globals.pop('reply_Q')
Line 239: globals().pop('reply_Q', None)
Line 1560: info.globals.pop('reply_Q', None)
Line 771 defines the reply_Q:
self.reply_Q = multiprocessing.Queue()
Line 1306 puts it into the client globals:
client.globals['reply_Q'] = self.reply_Q
Line 289 accesses it:
reply_Q = client_globals['reply_Q']
I am not sure of the architecture. Is there one reply_Q from the dispynode
to the dispyscheduler? Or are there many, one per running job?
…On Thu, Mar 21, 2019 at 8:01 PM Giridhar Pemmasani ***@***.***> wrote:
I noticed the function _dispy_job_func is unprepared for 'reply_Q' to be
absent.
That shouldn't happen (i.e., reply_Q should be in job globals). Can you
attach a sample (client) program that fails?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AueuOADsuawTlEBs4fLG3r6avgad9b_Fks5vZDmWgaJpZM4b9d08>
.
|
There is only one I am curious about your project. If it is okay, can you give some information? You can email me if you prefer. |
Giridhar,
Thank you for fixing the termination bug so quickly.
We are creating enterprise software for a large cluster where data has to
be sharded across a large number of nodes. We dispatch bots (typically as
python scripts) that must query and read data locally to package it for
final integration at the client side. Your dispy package looked great for
the messaging system. I am allowed to use the package as long as two staff
engineers understand the code well enough to repair it in case you become
unavailable to do so. I can explain more details about the project through
private email.
1. Continuing the architecture discussion: If multiple computations are
running concurrently on one dispynode, and they need to use the global
reply_Q simultaneously, how do you coordinate the usage?
2. I was testing the messaging between the dispyscheduler and the
dispynode. I start both and there are no client compute jobs. The scheduler
always successfully discovers the dispynode when the dispynode starts up.
However, it does not receive the termination signal when I exit the
dispynode. It remains in the scheduler's node list.
Work around: I discovered that sending one compute job from a client
through the scheduler clears this signaling fault. Now the scheduler
successfully receives the termination signal from the dispynode as it
exits.
3. When I change the cpus on the dispynode, it fails to inform the
scheduler. I see the reason for this bug in the code, dispynode.py, line
730:
self.scheduler = {'ip_addr': dispy._node_ipaddr(scheduler_node) if
scheduler_node else None,
'port': scheduler_port, 'auth': set(),
'addrinfo':
dispy.host_addrinfo(host=scheduler_node,
ipv4_multicast=self.ipv4_udp_multicast) if scheduler_node else None}
'auth' stores a python set object that is empty.
Line 2529 'if' statement has:
if self.scheduler['auth']:
addrinfo = self.scheduler['addrinfo']
sock = AsyncSocket(socket.socket(addrinfo.family,
socket.SOCK_STREAM),
keyfile=self.keyfile,
certfile=self.certfile)
sock.settimeout(MsgTimeout)
try:
yield sock.connect((self.scheduler['ip_addr'],
self.scheduler['port']))
info = {'ip_addr': addrinfo.ext_ip_addr, 'sign':
self.sign, 'cpus': cpus}
yield sock.send_msg('NODE_CPUS:'.encode() +
serialize(info))
The 'if' statement evaluates to False because the set object is empty. Thus
no message is sent to the scheduler.
If we can clean up and stabilize the package, our enterprise project might
set a record for the largest data volume being serviced by the dispy
system. Your package is well architected, and you have shown enthusiasm to
make it great. Well done, Giridhar!
…On Mon, Mar 25, 2019 at 4:31 PM Giridhar Pemmasani ***@***.***> wrote:
There is only one reply_Q at a dispynode (created in constructor). It is
(copy of it) sent to each job in its globals. The job setup (before client
computation) removes it and after the computation, the results are put in
that queue. Thus, each job needs to have it available in globals (otherwise
the results can't be sent back to dispynode queue server that then
processes results). I don't know how killing might have affected this.
Yesterday I committed improvements to job termination, along with the fix
for addrinfo issue mentioned in the first post of this issue, that may
help.
I am curious about your project. If it is okay, can you give some
information? You can email me if you prefer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AueuOJHj4Enfpeo_mt9sI9TCbxBdAp83ks5vaU4qgaJpZM4b9d08>
.
|
Great!
If you are asking about corruption due to multiple jobs using the queue simultaneously, then I was not aware that putting elements needs to be protected with a lock (I thought putting elements is thread/process safe). However, I think fixing this with a lock may not be simple, as the job can be terminated while lock is held. I will address this after releasing 4.10.6 (there are already too many changes in this release), probably tomorrow.
When just scheduler and node discover each other, nodes don't inform scheduler(s). Only after a scheduler sends a computation, node is reserved for that particular scheduler (so nodes are efficiently used). Sending a job is not required, just the computation (i.e., client used
This may be an issue. I will look into it after the release. |
Giridhar,
Thank you for such a prompt reply.
1. I believe you are correct that the multiprocessing.Queue is thread safe
to put/get objects on the reply_Q. When concurrent jobs are running, the
queue objects can be intermixed from the various jobs. The replies will
have to be sent back to the scheduler associated with the job that put them
on the reply_Q. Do I have the correct idea in mind? Can a job put more than
one packet into the reply_Q?
2. I noticed that the dispynode.py code explicitly acquires and releases
the thread_lock. Example:
Line 2391: self.thread_lock.acquire()
Line 2397 self.thread_lock.release()
Line 2405 self.thread_lock.release()
Suggestion: Perhaps it would be better to use the 'with' statement to
acquire and automatically release locks:
https://docs.python.org/3/library/threading.html#with-locks
Then you never have to worry about an underlying raised exception
abandoning an acquired lock (and you write less code).
3. We will need to send large python scripts instead of the compute
function. When I use the path string for the compute parameter, I noticed
that client sends the script to the scheduler who then sends it to the
node. Can I send the script file directly to the node with
cluster.send_file(path, node) and then invoke it to run directly on the
node? How do I get stack traces or print() output from the running script?
Thank you for helping in my understanding of the dispy system.
…On Tue, Mar 26, 2019 at 6:47 PM Giridhar Pemmasani ***@***.***> wrote:
We are creating enterprise software for a large cluster where data has to
be sharded across a large number of nodes.
Great!
1. Continuing the architecture discussion: If multiple computations
are running concurrently on one dispynode, and they need to use the global
reply_Q simultaneously, how do you coordinate the usage?
If you are asking about corruption due to multiple jobs using the queue
simultaneously, then I was not aware that putting elements needs to be
protected with a lock (I thought putting elements is thread/process safe).
However, I think fixing this with a lock may not be simple, as the job can
be terminated while lock is held. I will address this after releasing
4.10.6 (there are already too many changes in this release), probably
tomorrow.
1. I was testing the messaging between the dispyscheduler and the
dispynode. Work around: I discovered that sending one compute job from a
client through the scheduler clears this signaling fault. Now the scheduler
successfully receives the termination signal from the dispynode as it exits.
When just scheduler and node discover each other, nodes don't inform
scheduler(s). Only after a scheduler sends a computation, node is reserved
for that particular scheduler (so nodes are efficiently used). Sending a
job is not required, just the computation (i.e., client used JobCluster
or SharedJobCluster).
1. When I change the cpus on the dispynode, it fails to inform the
scheduler.
This may be an issue. I will look into it after the release.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AueuOHtkjmL6pyrl94IXJr-osQLzIp8oks5var-GgaJpZM4b9d08>
.
|
Yes,
I don't know if there is any overhead with context managers. I agree that using
I am a bit confused about then why you are using dispyscheduler, which is meant for sharing nodes simultaneously by multiple clients. I am wondering if using To answer above question about sending files with dispyscheduler, the client can't directly send files to nodes, because nodes are managed only by scheduler. Different parts of dispy and pycos use authentication strings for communication (every message, except for broadcasting or otherwise harmless messages , includes authentication that only valid peer knows). So with dispyscheduler, client can only talk to dispyscheduler and only dispyscheduler can talk to nodes. However, if If you describe in a bit more detail what you need or if you attach a simple program that you use with dispyscheduler currently, I can suggest if alternates suggested above are better. |
I am exploring how to run many jobs from different clients concurrently on
the same collection of dispynodes. Thus I am using the dispyscheduler. So
far the tests look promising.
When I use the compute() function, I am able to send input data files
directly to any chosen dispynode via cluster.send_file() and execute the
job there via cluster.submit_node(). The compute() function is able to
return output data files via dispy_send_file(). *Are sent data files routed
through the scheduler?* I am hoping not. The node object has the IP address
of the targeted dispynode, and I do not see any debug log messages on the
scheduler indicating transport of the data files. The compute() function
succeeds at file transport in both directions.
I then tried a similar experiment using a script instead of the compute()
function. (See attached script, motest.py; the client code is attached file
e6.py). Here I commented out the call to dispy_send_file() and the script
runs to completion. When I try to use dispy_send_file() in the script I get
the stack trace:
Traceback (most recent call last):
File "/tmp/dispy/node/172.30.67.114/140518880070400_rw63magd/motest.py",
line 13, in <module>
dispy_send_file(fpath)
File "/usr/local/lib/python3.6/site-packages/dispy/dispynode.py", line
134, in dispy_send_file
sock = socket.socket(__dispy_sock_family, socket.SOCK_STREAM)
NameError: name '__dispy_sock_family' is not defined
This is understandable; the globals needed by the function are not defined.
I do need to execute a script that can send files back to the client. The
compute parameter on SharedJobCluster is allowed to be a path to a script.
*Idea:* What if I send in a small compute() function that reads a script
file and calls exec() on it? Is this a good approach for the dispy system?
As your documentation recommends, I am trying to keep whats gets pickled
through the scheduler to a minimum and using send_file() and
dispy_send_file() for the large data transport.
…On Wed, Mar 27, 2019 at 5:49 PM Giridhar Pemmasani ***@***.***> wrote:
The replies will have to be sent back to the scheduler associated with the
job that put them on the reply_Q. Do I have the correct idea in mind? Can a
job put more than one packet into the reply_Q?
Yes, _dispy_job_func puts job result (only) in the queue (after the job
is done). If computation is a program, __job_program also puts the result
in the same queue.
1. I noticed that the dispynode.py code explicitly acquires and
releases the thread_lock. Suggestion: Perhaps it would be better to use the
'with' statement to acquire and automatically release locks:
https://docs.python.org/3/library/threading.html#with-locks
I don't know if there is any overhead with context managers. I agree that
using with may be cleaner, but if using it burns few more cycles
especially functions that execute often, it may be okay to forgo. Elsewhere
with is used where it is convenient.
1. We will need to send large python scripts instead of the compute
function. When I use the path string for the compute parameter, I noticed
that client sends the script to the scheduler who then sends it to the
node. Can I send the script file directly to the node with
cluster.send_file(path, node) and then invoke it to run directly on the
node? How do I get stack traces or print() output from the running script?
I am a bit confused about then why you are using dispyscheduler, which is
meant for sharing nodes simultaneously by multiple clients. I am wondering
if using JobCluster (which includes a scheduler) is better suited. And if
you don't need to compute at all but need to only send scripts, pycos
<https://pycos.sourceforge.io> is appropriate. With netpycos module you
can send files as and when needed.
To answer above question about sending files with dispyscheduler, the
client can't directly send files to nodes, because nodes are managed only
by scheduler. Different parts of dispy and pycos use authentication strings
for communication (every message, except for broadcasting or otherwise
harmless messages , includes authentication that only valid peer knows). So
with dispyscheduler, client can only talk to dispyscheduler and only
dispyscheduler can talk to nodes. However, if JobCluster is used, there
is no external scheduler, so files can be sent directly to nodes.
If you describe in a bit more detail what you need or if you attach a
simple program that you use with dispyscheduler currently, I can suggest if
alternates suggested above are better.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AueuOFxT0Sa4U4hKjQZSrwbsm3BWv7I0ks5vbAOcgaJpZM4b9d08>
.
|
If client calls
I don't see files attached in your message.
You can execute the script from within job computation, of course. |
#motest.py: dispy experimentimport dispy, socket, time |
#e6.py: OKD experimentsimport dispy, socket, time LOG_LEVEL = dispy.logger.DEBUG # ERROR, WARNING, DEBUG def get_nodes(cluster): def show_job(job, rv): if name == 'main': |
Sorry. The drag and drop would not accept the python files. When I pasted in the contents, the indentation was lost. Do you have an email address that I can send the files to you directly? |
My email is in |
When shutting down, node now broadcasts that it has terminated services, so any schedulers that have discovered it, but not yet using it, delete it. This addresses one of the issues mentioned in issue #181.
Dispynode cannot shutdown because of error:
2019-03-19 21:15:10 pycos - uncaught exception in !_shutdown/140335435848168:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pycos/init.py", line 3671, in _schedule
retval = task._generator.send(task._value)
File "/usr/local/bin/dispynode.py", line 2424, in _shutdown
sock = AsyncSocket(socket.socket(addrinfo.family, socket.SOCK_STREAM),
AttributeError: 'NoneType' object has no attribute 'family'
In function shutdown(), dispynode.py, line 2423:
addrinfo = self.sdcheduler['addrinfo'] is None (from original initialization).
Thus it should not be used to get the socket family (via addrinfo.family).
Instead the default AF_INET should be used whenever addrinfo is None.
The text was updated successfully, but these errors were encountered: