Cannot always get the expected return values for jobs #201

yutongz2019 · 2019-12-02T04:36:47Z

In the very beginning, thank you very much for the other issues I opened for dispy.

Now I am faced with the problem that I cannot always get the return values as expected for each job. In other words, under most conditions, the job is executed as expected and return expected values, but in some scenarios, it just returns a NoneType value.

For the jobs under such scenarios, there is a part of function waiting for some files showing up via while True loop. If the file exists in the beginning, the job will be executed normally, however, if the file does not exist in the beginning, even though it shows up within the given time constraint, it seems the job stops directly (at the end of the function, some features, such as time points of steps, will be inserted into a database, and there is no record for the scenarios not returning expected values) and only returns a NoneType value.

When I am faced with the above-described problems, the jobs scheduled after these non-expected-return-values jobs can be executed normally, but there is another situation that the scheduled job is hanging forever and the following jobs cannot be executed any longer. And the whole system is stuck, cannot be closed via cluster.close(). And I need to restart the whole system to get it back in work. Besides, this situation can happen at any time without a specific pattern (or I am not quite familiar with this issue).

I have tried to set the loglevel to debug to get more information about the first problem. But everything seems ok. each job has three different lines, one for long-job-id running, another for short-job-id execution and the last for reply received. So I have no idea what is happening. I would really appreciate it if you could provide some possible reasons for both problems.

Thank you very much for your help and guidance!

pgiri · 2019-12-02T05:32:59Z

Check the job status; if the job finished without errors, it would be dispy.DispyJob.Finished, otherwise, job should be considered failed (e.g., cancelled / job's execution raised an exception etc.), in which case job's stderr / exception attributes may have some useful information.

yutongz2019 · 2019-12-07T16:26:50Z

Thank you for your reply!

For the stuck jobs, I have checked the job status, ip address, stderr and exception. The job status is 5, and all other three values are none.

With the manual log file saved in the nodes, I found it finished all their computation tasks, but from the debugging info from on the node side, it does not send the result for the job. Could you please provide some more guidance on this problem?

Thanks!

pgiri · 2019-12-10T04:40:20Z

If the status is 5, it means job is still running. Job's attributes mentioned above are valid only after it is done (finished / terminated etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot always get the expected return values for jobs #201

Cannot always get the expected return values for jobs #201

yutongz2019 commented Dec 2, 2019

pgiri commented Dec 2, 2019

yutongz2019 commented Dec 7, 2019

pgiri commented Dec 10, 2019

Cannot always get the expected return values for jobs #201

Cannot always get the expected return values for jobs #201

Comments

yutongz2019 commented Dec 2, 2019

pgiri commented Dec 2, 2019

yutongz2019 commented Dec 7, 2019

pgiri commented Dec 10, 2019