Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of SGE interface #43

Merged
merged 37 commits into from
Nov 15, 2024
Merged

Conversation

QuantumChemist
Copy link
Contributor

Will close Matgenix/jobflow-remote#159 in jobflow-remote.

@QuantumChemist QuantumChemist marked this pull request as ready for review August 7, 2024 15:33
@QuantumChemist
Copy link
Contributor Author

QuantumChemist commented Aug 7, 2024

Hi @ml-evs 😃
I have finished the first version of the implementation but it's very hacky,

because I relied a lot on ChatGPT,

so I would be really glad about some feedback! 😁

And as none of our clusters use SGE, I wanted to at least check if the code produces the correct submission script, before I pass it on for testing.
So, I wanted to ask if there is something else I have to adjust in jobflow-remote?

Because when I make a submission test using:

from jobflow_remote.utils.examples import add
from jobflow_remote import submit_flow
from jobflow import Flow

job1 = add(1, 2)
job2 = add(job1.output, 2)

flow = Flow([job1, job2])

resources = {"cwd": "",
            "model": "smp 128",
            "job_name": "Autoplex_jf_test",
            "qout_path": "$JOB_ID.log",
            "qerr_path": "$JOB_ID.err",
            "device": "cpu",
            "soft_walltime": "05:00:00"}

print(submit_flow(flow, worker="auto_worker", resources=resources, project="auto"))

I run into the following problem:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│     created_on = '2024-08-07 16:45'                                                                                                                        │
│          db_id = '2632'                                                                                                                                    │
│          index = 1                                                                                                                                         │
│       metadata = {}                                                                                                                                        │
│           name = 'add'                                                                                                                                     │
│        parents = []                                                                                                                                        │
│ previous_state = 'UPLOADED'                                                                                                                                │
│       priority = 0                                                                                                                                         │
│         remote = {                                                                                                                                         │
│                      'step_attempts': 3,                                                                                                                   │
│                      'retry_time_limit': '2024-08-07 17:11',                                                                                               │
│                      'error': Traceback (most recent call last):                                                                                           │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/jobcontroller.py", line 3333, in             │
│                  lock_job_for_update                                                                                                                       │
│                      **kwargs,                                                                                                                             │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/runner.py", line 538, in advance_state       │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/runner.py", line 685, in submit              │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 155, in submit             │
│                      options=options,                                                                                                                      │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 186, in                    │
│                  write_submission_script                                                                                                                   │
│                      options=options,                                                                                                                      │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 104, in                    │
│                  get_submission_script                                                                                                                     │
│                      return ""                                                                                                                             │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/qtoolkit/io/base.py", line 57, in get_submission_script          │
│                      if header := self.generate_header(options):                                                                                           │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/qtoolkit/io/base.py", line 88, in generate_header                │
│                      raise ValueError(msg)                                                                                                                 │
│                  ValueError: The following keys are not present in the template: cwd, device, model, soft_walltime                                              │
│                                                                                                                                                            │
│                  }                                                                                                                                         │
│        run_dir = '/home/certural/Calcs_new/67/c5/8f/67c58fcb-d847-4fa1-8d29-9262ce0efa92_1'                                                                │
│          state = 'REMOTE_ERROR'                                                                                                                            │
│     updated_on = '2024-08-07 17:11'                                                                                                                        │
│           uuid = '67c58fcb-d847-4fa1-8d29-9262ce0efa92'                                                                                                    │
│         worker = 'bam_worker'                                                                                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Thank you a lot for your help and feedback!

@ml-evs
Copy link
Member

ml-evs commented Aug 7, 2024

Hi @QuantumChemist, nice work! I'll try to find the time to give some feedback before the end of the week. Thanks!

@ml-evs ml-evs self-requested a review August 15, 2024 22:58
Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.

Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.

I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
Comment on lines 513 to 525
supported += [
"njobs",
"memory_per_thread",
"time_limit",
"processes",
"processes_per_node",
"process_placement",
"nodes",
"threads_per_process",
"email_address",
"scheduler_kwargs",
"gpus_per_job",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem for this PR, but since #44 I've been wondering if we can programmatically enforce that the fields listed here are definitely qresources members, e.g., specifying an abstract reference to QResources.njobs and so on, rather than unchecked strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to implement it that way :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I agree that it can be for another PR, since all the objects should be adapted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ml-evs @gpetretto so shall I rather leave this for another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure, was something changed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that was a bit confusing. I clicked through the PRs and have added this
https://github.com/QuantumChemist/qtoolkit/blob/573641ec772508c71f1f02101902976bf0e1ab80/tests/io/test_sge.py#L240
for now.

@ml-evs ml-evs requested a review from gpetretto August 16, 2024 10:21
@QuantumChemist
Copy link
Contributor Author

Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.

Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.

I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.

Hey @ml-evs! 😃
thank you so much for your feedback.
I'm totally all for adding a test as in #45

So far, I couldn't test it with jobflow-remote as although I put sge as scheduler_type, I still have the problem as described above and I didn't have time yet to investigate it further.

I also have a few days off and will try to make the changes when I'm back!

Co-authored-by: Matthew Evans <[email protected]>
Copy link
Contributor

@gpetretto gpetretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding SGE support.
I have left some comments below that might need some update.
The main issue is that if SGE really does not allow to query the jobs by list of ids we will need to update jobflow-remote, since it currently expects to be able to use that functionality. If you can confirm that this is the case I will start thinking how to adapt it. It will either ask for all the jobs (which would likely be inefficient) or I would need to add an option in jobflow-remote to query the jobs based on the user name.
Unfortunately I don't have access to a cluster with SGE so I will not be able to make tests easily. @ml-evs do you have a docker with SGE?

Concerning your initial error in jobflow-remote, it is not obvious guessing what could go wrong. Maybe I would first check if it really using the SGEIO object.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
Comment on lines 513 to 525
supported += [
"njobs",
"memory_per_thread",
"time_limit",
"processes",
"processes_per_node",
"process_placement",
"nodes",
"threads_per_process",
"email_address",
"scheduler_kwargs",
"gpus_per_job",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I agree that it can be for another PR, since all the objects should be adapted.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
@QuantumChemist
Copy link
Contributor Author

Hi @gpetretto .
Thank you a lot for having a look at my PR. I think I will need quite a while to properly work through everything.
One problem that I have is that I don't have access to a system with SGE (as I'm not implementing this for myself) but I can ask people to test it. Therefore I just started with what I have found in aiida and SGE manuals.

@ml-evs
Copy link
Member

ml-evs commented Aug 21, 2024

Hi @gpetretto . Thank you a lot for having a look at my PR. I think I will need quite a while to properly work through everything. One problem that I have is that I don't have access to a system with SGE (as I'm not implementing this for myself) but I can ask people to test it. Therefore I just started with what I have found in aiida and SGE manuals.

It's taking a bit longer than expected to generalise the test docker container in jobflow-remote for other queues, but I'll keep you posted here too.

@gpetretto
Copy link
Contributor

Thanks both.
@QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

@QuantumChemist
Copy link
Contributor Author

Thanks both. @QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

Sorry @gpetretto , this message kind of slipped my attention and I have just seen it now! I will ask someone to try this! Thank you!

@QuantumChemist
Copy link
Contributor Author

Thanks both. @QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

Hi @gpetretto 👋🏻

so I asked someone if it's possible to query with a list of job-IDs and it's not directly possible, only with a bash script as a workaround (job_ids=(202354 202353); for job_id in "${job_ids[@]}"; do qstat -j $job_id; done) and querying a single job-ID looks like:

Screenshot 2024-09-18 at 17 47 23

@gpetretto
Copy link
Contributor

Thanks @QuantumChemist.
I will then add the option based on the username to jobflow-remote. Otherwise even with this implementation, SGE would not be usable.

@QuantumChemist
Copy link
Contributor Author

Thanks @QuantumChemist. I will then add the option based on the username to jobflow-remote. Otherwise even with this implementation, SGE would not be usable.

Thank you a lot! :)

@QuantumChemist
Copy link
Contributor Author

@gpetretto I have now implemented the last changes 😃
let me know if more adjustments are needed!

@gpetretto
Copy link
Contributor

Hi @QuantumChemist, thanks for the updates. Still, I have seen that you have moved the _qresources_mapping and the other methods to be attributes of the instances rather than of the classes. I know it will not change much from a practical point of view, but can you make them class attributes, for uniformity with the other ones?

Now when we want to switch the scheduler type to sge, we cannot restart the daemon . Do you have an idea what we could try?

If it is not a problem with the configuration file (i.e. you can run the jf project check successfully), I would suggest looking at the logs in ~/.jfremote/PROJECT_NAME/log/runner.log to see if you can find any error and/or trying running the runner with jf runner run`. This will start the runner in the foreground. It is not really meant for standard execution, but it is a quick way to get the stacktrace if the runner fails for whatever reason.

@QuantumChemist
Copy link
Contributor Author

Hi @gpetretto ! Sorry for confusing those two things (I'm still learning a lot about Python object oriented programming).
Is it ok like this?

@gpetretto
Copy link
Contributor

Thanks @QuantumChemist. It should be fine now.
Did you manage to test this successfully together with jobflow-remote? Or are you still experiencing issues?

@QuantumChemist
Copy link
Contributor Author

Thanks @QuantumChemist. It should be fine now. Did you manage to test this successfully together with jobflow-remote? Or are you still experiencing issues?

thank you, too!
My colleague's server (with SGE) is down this week and I have some days off until next week, so I couldn't test it yet.

@ml-evs
Copy link
Member

ml-evs commented Oct 11, 2024

Matgenix/jobflow-remote#160 (comment)

@QuantumChemist
Copy link
Contributor Author

Matgenix/jobflow-remote#160 (comment)

I might to check more closely with my colleague if there might be a problem with output writing.

@ml-evs
Copy link
Member

ml-evs commented Oct 11, 2024

Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again!

@QuantumChemist
Copy link
Contributor Author

Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again!

I will let you know when I have checked this with my colleague. I didn't have the time yet to continue this with her.

@QuantumChemist
Copy link
Contributor Author

QuantumChemist commented Oct 24, 2024

Hi @gpetretto and @ml-evs 😃 ,

I have some good news!
Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine (i.e. all job states are "COMPLETED" in the end).
But when we want to run a DoubleRelax VASP wf, we are stuck at the following step:
Bild (3)

The strange thing about this is that the VASP job is finishing fine and there is also no output in queue.err. There is also no other file that would indicate towards a segmentation fault etc. when executing VASP. We were not able to find out why exactly the job was killed. All the jobflow-remote related json files look perfectly fine from what I can tell.
It's a bit hard to say where this issue could be rooted in, whether it's the cluster, qtoolkit or jobflow-remote.
Do you have any ideas? Thank you a lot!

@ml-evs
Copy link
Member

ml-evs commented Oct 24, 2024

I have some good news! Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine

Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you!

@QuantumChemist
Copy link
Contributor Author

I have some good news! Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine

Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you!

Alright 👍🏻

@gpetretto
Copy link
Contributor

gpetretto commented Oct 25, 2024

Hi @QuantumChemist, thanks for testing the new implementation.
I should say I don't have an idea about why such an error could happen only when switching to atomate2 jobs. Does SGE have any command that would provide details on a terminated job? Just to know if it could be indeed possible that SGE killed the job, or not.
Just a few words to clarify the meaning of the error: jobflow-remote writes the output in a file Store in the run folder: for atomate2 you will expect to have a remote_job_data.json and a additional_store_data.json files in the folder. The rest of the response is written in the jfremote_out.json file. However, the jfremote_out.json is written first at the beginning, with just the starting time of the execution of the job: https://github.com/Matgenix/jobflow-remote/blob/82457776ed3ffa5fabca74daa4ee7af24cbaab4e/src/jobflow_remote/jobs/run.py#L40. The response is written at the end: https://github.com/Matgenix/jobflow-remote/blob/82457776ed3ffa5fabca74daa4ee7af24cbaab4e/src/jobflow_remote/jobs/run.py#L82.
If the VASP calculation completed, but the jfremote_out.json does not contain the response, the problem should have happened between the simulation end and the dumping of the file.
A few things that could be checked are:

  • does the remote_job_data.json contain the TaskDoc with output of the vasp job?
  • does the jfremote_out.json contain indeed only the start_time? Or also something else, but not the response?
  • Could it be that the final jfremote_out.json was written somewhere else, for example in the home? (I hardly believe so but just to check)
  • have the files like queue.err, queue.out, and the jsons been zipped by the atomate2 job?

@QuantumChemist
Copy link
Contributor Author

Hi @gpetretto ,

thank you for your feedback! I will check the things you pointed out together with my colleague!

@QuantumChemist
Copy link
Contributor Author

Hi @gpetretto ,
my colleague has investigated the problem further with your tips and here is what she says:

Hi Chris,

Just gone through the comments:

  • SGE command to get detailed job information: qstat -explain c -j JOB_ID, used this to track the jobs whilst running but showed no weird signs as far as I can tell, and once the job is completed this no longer shows any information.
  • remote_job_data.json does seem to contain the TaskDoc he mentions, with the outputs e.g. free energy, time elapsed etc within it
  • I’m pretty sure the jfremote_out.json has not been written anywhere else, not sure how I’d extensively check this but can’t see any files anywhere else
  • jfremote_out.json does contain the start_time and the response
  • The queue.err, queue.out, and json files have not been zipped. All the VASP outputs have been zipped. (In the successful “add” workflows the err, out and json files were not zipped, so I assume this is a good sign)

However, have been doing some more experimenting this evening, and have managed to get both Python and VASP test runs working. I kept on getting different errors today, looking like this:

Screenshot 2024-10-28 at 16 50 11

And I think the issue here is that it was taking a very long time for jfremote_out.json to be written to the output folder after a calculation was completed, so it would be marked as a REMOTE_ERROR, then the out file would appear sometimes half an hour later. Running Jf job retry -s REMOTE_ERROR after that seemed to fix the workflow.
It all seems a bit inconsistent though, and maybe dependent on how heavily Epsilon is being used at the time and so how long it takes to write the outputs.

From what I have seen when we tested it together, it looked to me like there was some delay in writing the files. I can't say if this comes from SGE or the cluster Epsilon. Do you have an idea how to handle the delay?

@gpetretto
Copy link
Contributor

Hi @QuantumChemist,

thanks a lot for the detailed reply.
From what you report, it indeed seems related to a delay between the end of the SGE job and the availability of the files on the file system of the front end of the cluster. I suppose this might be caused by a slow/overloaded NFS for example.
The output files like jfremote_out.json are written by jobflow inside the SGE job. Jobflow-remote's runner will just check if the SGE job is completed and then proceed with the download of the files. So here you are hitting two different kinds of issues:

  1. if the job is very short (like in the last error), the jfremote_out.json is not even present, getting you that error while trying to download the files
  2. for a more normal job, the file will be created at the beginning of the job and since it takes some time to be executed, it will be synchronized. But if when the job ends it takes some time to synchronize the updated version, the file will be downloaded, thus completing correctly the "download" phase. And then it will fail at the "complete" step because the response is missing.

If that is the problem, I can propose 2 solutions:

  1. in the project configuration you can add a post_run for the worker. You can add a sleep there. This will be executed after the jfremote_out.json has been created, so it will give the system the time to synchronize it. I have no clue how long this sleep needs to be, as this should probably be based on an reasonable estimate of how long the system takes. This will waste some resources, since the sleep will be in the SGE job. However, if this is just a few seconds and it seems acceptable, then it could be the solution.
  2. If the sychronization could istead take some time and adding a sleep for all the jobs may end up wasting a lot of computational resources, I think we can consider adding a delay_download option to the worker configuration that will postpone the start of the download by a specified amount of seconds. I have an idea about how to do that, but I will need to check that it really works fine.

I would ask you to try first with option 1. This should confirm that the issue is indeed the FS delay and probably give you a rough estimate of the waiting time required. If this is a use case, we can then consider implementing the option. Would that be fine?

In any case, this seems definitely a jobflow-remote issue and not related to your SGE implementation. If you confirm that the problem is the delay, I will open an issue there to further discuss it. This should not be a reason to hold off merging this PR.
@ml-evs any update on your side about the tests?

@QuantumChemist
Copy link
Contributor Author

Hi @gpetretto ,

sorry for the late reply!

My colleague has tried to add sleep 120 as a post_run step in the jfr configuration file as you suggested and with that she was able to run atomate2 and also autoplex workflows. For some jobs she had to do jf job retry -s REMOTE_ERROR, but that is an autoplex-related issue, and the retried jobs run successfully.
I think this file issue and the needed delay is related to her local cluster as when she uses her SLURM setup to submit to another remote-cluster, she also bumps into this issue which she has quick-fixed the same way.

Therefore my feeling is, that my implementation here actually works as intended already.

We would be happy about a delay_download option in jfr then! 😃 I can also help with that if you need help.

@ml-evs
Copy link
Member

ml-evs commented Nov 15, 2024

@ml-evs any update on your side about the tests?

Sorry @gpetretto and @QuantumChemist, I missed this at the time and have been travelling without a chance to check the tests.

If things are working in reality now, shall we merge this with the expectation that there might be some additional tweaks contributed back here? We can maybe delay a qtoolkit release until I can test it properly in JFR. More generally there is the question of where these tests should run, but we can come back to that!

@gpetretto
Copy link
Contributor

Thanks both for the updates.

My plans were actually to merge this, implement one more update in QTK that I was holding off to avoid the trouble of merging with these changes, and make a new release of QTK. I would probably proceed with the merging at this point. @ml-evs I will wait for an update from your side before releasing a new version.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gpetretto gpetretto merged commit 62b683f into Matgenix:develop Nov 15, 2024
5 checks passed
@QuantumChemist
Copy link
Contributor Author

Thank you for merging! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How could I use SGE for job submission?
3 participants