Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check status for simulation code in jf job list #227

Open
FabiPi3 opened this issue Dec 8, 2024 · 5 comments · May be fixed by #228
Open

Check status for simulation code in jf job list #227

FabiPi3 opened this issue Dec 8, 2024 · 5 comments · May be fixed by #228

Comments

@FabiPi3
Copy link
Collaborator

FabiPi3 commented Dec 8, 2024

This is a basically a feature request, which I guess is not so easy to implement. But maybe someone knows an easier way to achieve what I want.

Here is the story: I run mainly a simulation code with jobflow-remote, so something like VASP or Abinit. Usually a single, lets say VASP, calculation corresponds to one jobflow job. Now looking at the result with jf job info or jf job list, I see that the job(s) is COMPLETED. This is good, of course. But it does only mean that there was no python error. Basically I catch any error in the simulation code via python to ensure appropriate error handling and a proper entry in the output database. What I would like to see also is directly a hint whether the actual calculation from the simulation code was successful or not.

So one question would be on how to pass this information to jobflow(-remote) since it is a very specific information. I could imagine creating directly during runtime a special file success.out which contains a status string or whatever. Another option might be to somehow return a Response object with a new field simulation_success. But this might be already to specific. Then jobflow-remote would need to parse this and include in the JobDoc which is quite some overhead I guess.

Another option would be to generate this info only locally while running the jf job list command. But this again requires a very specific format to be kept. In principle I could provide a check function for my simulation code which takes the run_dir and the stdout/stderr and determines the run success. But as it needs to read different files depending on rather complex logic, one would need to download all those files which is also not a very good option I guess.

Any opinion?

@FabiPi3 FabiPi3 changed the title Check status for simulation code in JobDoc Check status for simulation code in jf job list Dec 8, 2024
@gpetretto
Copy link
Contributor

Hi @FabiPi3,

in general the idea of jobflow is that if a Job finishes correctly it is marked as COMPLETED so that the workflow can continue. On the other hand, if the calculation is not successful, the Job should be marker as FAILED (and usually the workflow stopped). Typically what happens is that if the simulation fails a python exception is raised purpusedly. So, FAILED Jobs usually do not mean just that a python error occurred, but rather that for some reason the task to be performed in the Job is not completed. As an example, atomate2 Jobs behave in this way: if validation of the output fails in custodian the exception is let pass and is handled by the manager.
From this point of view, I feel that any additional field like simulation_success would be redundant.

If your purpose is to identify Jobs that did not complete correctly here are a few options that you can consider:

  • if your calculation is not successful, let an exception pass (or raise one yourself), so that the Job will be FAILED.
  • having a FAILED Job is problematic because even if the calculation did not succeed you still need the Flow to continue, you can consider setting the on_missing_references to allow failed parents https://github.com/materialsproject/jobflow/blob/29ff899fa3ddbebe88e9725c2cd5d54d7d1a40c5/src/jobflow/core/job.py#L64
  • if you need to add some information to the Job document, the Response has a stored_data attribute. If you set it in the response it will be saved in JobDoc.stored_data when the Job is completed.

Although all these options will not allow you to tell at a glance what was exactly the problem with the job just by running jf job list, do you think that this would help solving your issue?

@FabiPi3 FabiPi3 linked a pull request Dec 10, 2024 that will close this issue
@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Dec 10, 2024

Thanks for your answer @gpetretto

I experienced that if the job fails with an raised error there will be no entry in the output database. I guess there is no way to change this? Wouldn't make much sense to have an raised error which also returns some data, or?

In general I want to keep my internal error handling to have consistent entries in the output database. Sometimes a failed calculation gives you also some information you want.

I tested the stored_data attribute, and it looks quite promising. I quickly implemented an option to show these also with jf job list, maybe that would be a possibility? See #228.

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Dec 17, 2024

Another small point: I was just wondering whether this stored_data is actually used somewhere in jobflow or jobflow-remote. I couldn't really find something. On the other hand, the annotation in the Response class says dict[Hashable, Any] which would mean any value is allowed. In practice I got a weird error message when trying to return an enum value, see picture:

Bildschirmfoto 2024-12-17 um 11 27 18

A string worked fine. So I am not sure where the issue is and what it is related to. Should I open a new issue here?

@gpetretto
Copy link
Contributor

From the stack trace this looks more like an issue in monty about how it deals with Enum. If I remember correctly you already made some changes to that part. Can you try just dumping a simple dictionary with an enum like in your case and check if you get the same error? Maybe a call to jsanitize is needed on the stored_data. Or on the whole response.

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Dec 18, 2024

Yes I did. After some searching, I think the issue comes from jobflow. I am subclassing their ValueEnum, and see this:

https://github.com/materialsproject/jobflow/blob/ba0db5a4ae1554077114183c85c362371d3de94b/src/jobflow/utils/enum.py#L25

They apparently have a to_dict() method which returns a str. And this of course leads to an error in monty. Do you think I could get rid of this method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants