Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot always get the expected return values for jobs #201

Open
yutongz2019 opened this issue Dec 2, 2019 · 3 comments
Open

Cannot always get the expected return values for jobs #201

yutongz2019 opened this issue Dec 2, 2019 · 3 comments

Comments

@yutongz2019
Copy link

In the very beginning, thank you very much for the other issues I opened for dispy.

Now I am faced with the problem that I cannot always get the return values as expected for each job. In other words, under most conditions, the job is executed as expected and return expected values, but in some scenarios, it just returns a NoneType value.

For the jobs under such scenarios, there is a part of function waiting for some files showing up via while True loop. If the file exists in the beginning, the job will be executed normally, however, if the file does not exist in the beginning, even though it shows up within the given time constraint, it seems the job stops directly (at the end of the function, some features, such as time points of steps, will be inserted into a database, and there is no record for the scenarios not returning expected values) and only returns a NoneType value.

When I am faced with the above-described problems, the jobs scheduled after these non-expected-return-values jobs can be executed normally, but there is another situation that the scheduled job is hanging forever and the following jobs cannot be executed any longer. And the whole system is stuck, cannot be closed via cluster.close(). And I need to restart the whole system to get it back in work. Besides, this situation can happen at any time without a specific pattern (or I am not quite familiar with this issue).

I have tried to set the loglevel to debug to get more information about the first problem. But everything seems ok. each job has three different lines, one for long-job-id running, another for short-job-id execution and the last for reply received. So I have no idea what is happening. I would really appreciate it if you could provide some possible reasons for both problems.

Thank you very much for your help and guidance!

@pgiri
Copy link
Owner

pgiri commented Dec 2, 2019

Check the job status; if the job finished without errors, it would be dispy.DispyJob.Finished, otherwise, job should be considered failed (e.g., cancelled / job's execution raised an exception etc.), in which case job's stderr / exception attributes may have some useful information.

@yutongz2019
Copy link
Author

Thank you for your reply!

For the stuck jobs, I have checked the job status, ip address, stderr and exception. The job status is 5, and all other three values are none.

With the manual log file saved in the nodes, I found it finished all their computation tasks, but from the debugging info from on the node side, it does not send the result for the job. Could you please provide some more guidance on this problem?

Thanks!

@pgiri
Copy link
Owner

pgiri commented Dec 10, 2019

If the status is 5, it means job is still running. Job's attributes mentioned above are valid only after it is done (finished / terminated etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants