-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of SGE interface #43
Conversation
Hi @ml-evs 😃 because I relied a lot on ChatGPT,so I would be really glad about some feedback! 😁 And as none of our clusters use SGE, I wanted to at least check if the code produces the correct submission script, before I pass it on for testing. Because when I make a submission test using: from jobflow_remote.utils.examples import add
from jobflow_remote import submit_flow
from jobflow import Flow
job1 = add(1, 2)
job2 = add(job1.output, 2)
flow = Flow([job1, job2])
resources = {"cwd": "",
"model": "smp 128",
"job_name": "Autoplex_jf_test",
"qout_path": "$JOB_ID.log",
"qerr_path": "$JOB_ID.err",
"device": "cpu",
"soft_walltime": "05:00:00"}
print(submit_flow(flow, worker="auto_worker", resources=resources, project="auto")) I run into the following problem:
Thank you a lot for your help and feedback! |
Hi @QuantumChemist, nice work! I'll try to find the time to give some feedback before the end of the week. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.
Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.
I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.
src/qtoolkit/io/sge.py
Outdated
supported += [ | ||
"njobs", | ||
"memory_per_thread", | ||
"time_limit", | ||
"processes", | ||
"processes_per_node", | ||
"process_placement", | ||
"nodes", | ||
"threads_per_process", | ||
"email_address", | ||
"scheduler_kwargs", | ||
"gpus_per_job", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem for this PR, but since #44 I've been wondering if we can programmatically enforce that the fields listed here are definitely qresources members, e.g., specifying an abstract reference to QResources.njobs
and so on, rather than unchecked strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to implement it that way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I agree that it can be for another PR, since all the objects should be adapted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ml-evs @gpetretto so shall I rather leave this for another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not sure, was something changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, that was a bit confusing. I clicked through the PRs and have added this
https://github.com/QuantumChemist/qtoolkit/blob/573641ec772508c71f1f02101902976bf0e1ab80/tests/io/test_sge.py#L240
for now.
Hey @ml-evs! 😃 So far, I couldn't test it with jobflow-remote as although I put sge as I also have a few days off and will try to make the changes when I'm back! |
Co-authored-by: Matthew Evans <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding SGE support.
I have left some comments below that might need some update.
The main issue is that if SGE really does not allow to query the jobs by list of ids we will need to update jobflow-remote, since it currently expects to be able to use that functionality. If you can confirm that this is the case I will start thinking how to adapt it. It will either ask for all the jobs (which would likely be inefficient) or I would need to add an option in jobflow-remote to query the jobs based on the user name.
Unfortunately I don't have access to a cluster with SGE so I will not be able to make tests easily. @ml-evs do you have a docker with SGE?
Concerning your initial error in jobflow-remote, it is not obvious guessing what could go wrong. Maybe I would first check if it really using the SGEIO
object.
src/qtoolkit/io/sge.py
Outdated
supported += [ | ||
"njobs", | ||
"memory_per_thread", | ||
"time_limit", | ||
"processes", | ||
"processes_per_node", | ||
"process_placement", | ||
"nodes", | ||
"threads_per_process", | ||
"email_address", | ||
"scheduler_kwargs", | ||
"gpus_per_job", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I agree that it can be for another PR, since all the objects should be adapted.
Hi @gpetretto . |
It's taking a bit longer than expected to generalise the test docker container in jobflow-remote for other queues, but I'll keep you posted here too. |
Thanks both. |
Sorry @gpetretto , this message kind of slipped my attention and I have just seen it now! I will ask someone to try this! Thank you! |
Hi @gpetretto 👋🏻 so I asked someone if it's possible to query with a list of job-IDs and it's not directly possible, only with a bash script as a workaround ( |
Thanks @QuantumChemist. |
Thank you a lot! :) |
… a FINISHED job state and added another link for SGE documnetation
@gpetretto I have now implemented the last changes 😃 |
Hi @QuantumChemist, thanks for the updates. Still, I have seen that you have moved the
If it is not a problem with the configuration file (i.e. you can run the |
Hi @gpetretto ! Sorry for confusing those two things (I'm still learning a lot about Python object oriented programming). |
Thanks @QuantumChemist. It should be fine now. |
thank you, too! |
I might to check more closely with my colleague if there might be a problem with output writing. |
Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again! |
I will let you know when I have checked this with my colleague. I didn't have the time yet to continue this with her. |
Hi @gpetretto and @ml-evs 😃 , I have some good news! The strange thing about this is that the VASP job is finishing fine and there is also no output in queue.err. There is also no other file that would indicate towards a segmentation fault etc. when executing VASP. We were not able to find out why exactly the job was killed. All the jobflow-remote related json files look perfectly fine from what I can tell. |
Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you! |
Alright 👍🏻 |
Hi @QuantumChemist, thanks for testing the new implementation.
|
Hi @gpetretto , thank you for your feedback! I will check the things you pointed out together with my colleague! |
Hi @gpetretto ,
From what I have seen when we tested it together, it looked to me like there was some delay in writing the files. I can't say if this comes from SGE or the cluster Epsilon. Do you have an idea how to handle the delay? |
Hi @QuantumChemist, thanks a lot for the detailed reply.
If that is the problem, I can propose 2 solutions:
I would ask you to try first with option 1. This should confirm that the issue is indeed the FS delay and probably give you a rough estimate of the waiting time required. If this is a use case, we can then consider implementing the option. Would that be fine? In any case, this seems definitely a jobflow-remote issue and not related to your SGE implementation. If you confirm that the problem is the delay, I will open an issue there to further discuss it. This should not be a reason to hold off merging this PR. |
Hi @gpetretto , sorry for the late reply! My colleague has tried to add sleep 120 as a post_run step in the jfr configuration file as you suggested and with that she was able to run atomate2 and also autoplex workflows. For some jobs she had to do Therefore my feeling is, that my implementation here actually works as intended already. We would be happy about a |
Sorry @gpetretto and @QuantumChemist, I missed this at the time and have been travelling without a chance to check the tests. If things are working in reality now, shall we merge this with the expectation that there might be some additional tweaks contributed back here? We can maybe delay a qtoolkit release until I can test it properly in JFR. More generally there is the question of where these tests should run, but we can come back to that! |
Thanks both for the updates. My plans were actually to merge this, implement one more update in QTK that I was holding off to avoid the trouble of merging with these changes, and make a new release of QTK. I would probably proceed with the merging at this point. @ml-evs I will wait for an update from your side before releasing a new version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you for merging! :D |
Will close Matgenix/jobflow-remote#159 in jobflow-remote.