-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: database not updating after job is finished #912
Comments
It looks like a custodian error but I am not sure. There was this bug: materialsproject/custodian#340 |
In this example, custodian indeed corrected somethng. But in other calculations I tried custodian didn't do any corrections and it still didn't update the database, so it seems like it's not the same problem. I'm not quite sure where to look for indications for the error since the output looks normal. But if you have any ideas, I'm happy to try it! |
I had a similar issue with the job state update, but it only happened from time to time so that I lived with the |
Yes, that's also what my coworker experienced. Unfortunately, with my calculation it happens every time so detect_lostruns is not so helpful :/ |
How does the timing look like? Does the database insertion completely finish during the process or is the time only enough to finish the vasp run but not the whole database insertion? This happened to me before. |
The vasp run usually finishes with plenty of time. The calculation takes something around 2h to finish but it has 12h available. |
@edansi, do the VASP calculation files get gzipped for the hanging calculation? |
no they don't get zipped. |
Ok, that would imply that the issue is not with database insertion since the VASP job has not yet got to the gzipping part. I agree this could be an issue with custodian. Potentially it was not able to kill the VASP processes successfully. You could try writing a python script to run custodian in a directory containing the INCAR, KPOINTS, POSCAR, POTCAR and check if it finishes successfully. E.g., essential run the contents of this function: atomate2/src/atomate2/vasp/run.py Line 84 in 06e4a71
|
@utf I ran the calculation, and the vasp calculation finished. How can I see if the custodian killed the VASP processes correctly? I get all the output files, a custodian.json and a std_err.txt. After the Vasp calculation finishes, the slurm job continues running until the time limit. |
Hi @edansi, if you check your custodian.json file (or share it here), there should be a set of You can check which errors were caught and which corrective actions were taken like the following code snippet
My guess is that |
@esoteric-ephemera thanks, i attached the custodian file and also the python code to submit. The file has neither errors nor actions, doesn't this mean that no errors occured?
|
Hey @edansi, yes your custodian file indicates no errors were raised I'm confused about the "python script to submit" part - if you're adding jobs to your fireworks database, you want to launch them through fireworks. The code snippet you sent only runs a job with custodian, and doesn't handle any of the automated file writing, parsing, etc. It also looks like you were running with fireworks previously Are you submitting jobs to your job scheduler using the command line interface with For debugging purposes, it might be better to completely eliminate the database insertion step / fireworks to see why the jobs aren't running. You can do that by manually submitting a job that runs this:
|
@esoteric-ephemera i misunderstood your comment before, my last answer was refering to what @utf was suggesting. Yes in the custodian.json from my original example there's an action for the LargeSigmaHandler. But I don't think it's related to this, first, because the vasp output files look normal and finished and second, because I have calculations where I don't get this error and it still doesn't work. I ran the job locally as you suggested with a DoubleRelaxMaker() and it worked, it also zipped everything. There were some custodian action but it didn't stop the job from finishing. Does this mean the problem lies somewhere else? This is the job.error file from the local run
and the corrections the custodian took in the local run are:
|
Great to hear and absolutely no worries. I suspect that the issue lies with your fireworks or jobflow I usually re-export all of the yaml config file environment variables,
This behavior is custodian working as intended, which is also a good sign |
@esoteric-ephemera setting the environment variables doesn't change anything. I realized that one of my calculations at some point actually worked, so I tried to figure out what was different. When I reran the local by setting my kpoints I got an error from the monty package which I solved by updating to a different version. For a short moment I thought it was solved but it still doesn't fix it for all my calculations. So now I'm setting up a new environment from scratch to see if that helps. Do you have any other idea what I could try? |
It's hard to say what the issue is without more info. My guess is your fireworks is fine since the first screenshot you sent is at fw_id > 2000 (it worked at some point) and jobflow is the culprit To test this, let's take atomate2 out of the equation and just use jobflow and fireworks:
Can you add this to your fw database and run it on hpc? (just on a debug or shared queue, I know it's a terrible use of compute) |
Thanks for helping me out, I really appreciate it :) I ran the code you suggested and it ends in a FIZZLED state with the following error message:
|
My bad, I forgot to mention that the function fireworks calls has to live in your
and ensure this file lives in your
Double checked that this approach works on my end. |
thanks! so, if i run the simple_job, it finishes without problems and the job changes to COMPLETED in the database :/ the problem doesn't seem to be there either. when i opened the issue, the cluster where this problem occurred was the only running hpc ressource but now i can finally run on the other cluster again. so it's also fine if we don't solve this problem. sorry for the inconvenience. |
Describe the bug
When I'm running a double relaxation workflow with Vasp with SmNiO3, the first relaxation finishes but the database doesn't update the status to completed or start the second relaxation step. If I'm doing a static calculation on the same structure everything runs fine and the database updates to completed once it's finished. If I run the double relaxation on a simple BaTiO3 cubic unit cell (5 atoms) the double relaxation is updated in the database and the status changes to completed once it's finished.
There's no unusual errors in the fireworks out or error file and the vasp.out file shows the calculation finished without problems.
I'm not sure if the level occurs on the atomate2 or fireworks or a different level. But thanks already for your help!
To Reproduce
I'm using
and the GPU version of Vasp 6.3.0
I tried different settings to submit the calculations, this is the last one I tried and the shortest code. It's for a RelaxBandStructure workflow but it also fails at the initial relaxation.
here's the dict of the starting structure:
Expected behavior
I would expect the database to update to COMPLETED or FIZZLED once the vasp calculation finishes but it stays at RUNNING. It only updates once I use the command lpad detect_lostruns.
Screenshots
Fireworks out file
Fireworks error file
The text was updated successfully, but these errors were encountered: