Refactoring, bug fixes and adding tests #18

unkcpz · 2024-06-05T08:24:04Z

This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:

correctly support memory setup for resources.
support turn on hyperthreading with latest version of hyperqueue.
Use ruff to lint
Fix the submit bug for hq > 0.12 that resources are configured twice in job script and submit command.
Support install hq to remote computer over CLI.
WIP adding unit tests with submit to real hq using the fixture from hyperqueue repo.

The major change I made in terms of resource setting is I didn't use num_mpiprocs and rename num_cores -> num_cpus, rename memory_Mb -> memory_mb.
The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from either ParEnvJobResource as SGE type scheduler nor NodeNumberJobResource. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercx

Issues:

If remote binary exist, cannot override install. Hit sftp error (OSError: Failure)
Think of is it a new problem after eiger updated, only from the same login node can access the server.
Specifying HQ_SERVER_DIR explicitly, to distinguish multiple server (see Distinguish hq-server folder to have multiple server for different machines share the same home It4innovations/hyperqueue#719)
Should allow to pause and resume the alloc.

Must have features:

Use NodeNumberJobResource as parent and provide option for use case on LUMI that will require multinode functionality of HQ.
- how to tell alloc to fire workers in the same group? Every new multinode run is managed to a certain group. Which means if -N is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.

The resource can be set with #HQ or through CLI, but not both. The CLI options are removed from submit command.

- Change command to aiida-hq - add aiida-hq install <computer> - fix start server timeout problem - pre-commit lint

a

unkcpz · 2024-07-14T07:49:07Z

I want to merge the PR first so I can keep on working on support multi-node feature needed by Timo of on using this in demo server deployment.
It will not affect current user if package is installed from pypi. We can decide when to release a new one after check if everything is ready. For me, I had it tested and run millions on Eiger and Daint and on demo server.
Let me know if you don't agree @mbercx.

Keep it open too long also means need to rework after aiidateam/aiida-core#6043.

mbercx

Quite a large PR you have there, @unkcpz. ^^ I didn't go through the code line-by-line, but in general everything looks very good, thanks for the hard work! 🚀

mbercx · 2024-07-15T20:53:30Z

Might want to split up the changes in multiple commits, but I would understand if you don't want to go through the hassle. Also fine to squash and merge and simply have a single commit that describes all the changes.

unkcpz · 2024-07-15T22:44:32Z

Thanks @mbercx!!

Also fine to squash and merge and simply have a single commit that describes all the changes.

I'll rebase to less commits and reword a bit the commit message and do a rebase merge. I did one rebase before and try to keep every commit independent as possible.

t-reents · 2024-07-16T10:27:43Z

aiida_hyperqueue/scheduler.py

 }


 class HyperQueueJobResource(JobResource):
    """Class for HyperQueue job resources."""

-    _default_fields = ('num_mpiprocs', 'num_cores', 'memory_Mb')
+    _default_fields = ("num_cpus", "memory_mb")


Hi @mbercx @unkcpz

I'm not a 100% sure about the change of removing the num_mpiprocs resource and only using num_cpus. I think it makes sense to use this setup on CPU systems, e.g. requesting 24 cores and submitting with srun -n 24 ... using 24 tasks. However, our recommended setup on Lumi uses the following resources: We have 8 GPUs and 56 CPUs, i.e. 7 CPUs per GPU. One would request 8 tasks per node, i.e. one per GPU.

This being said, we would request 56 CPUs from HQ but still want to submit using srun -n 8 .... In this scenario, num_mpiprocs is not equal to num_cores.

Concerning the multi-node usage, I've already started a draft locally to test the use-case. If you don't mind, I'll open a PR once this is merged. We can use it as a starting point and work on it together.

Having in mind the multi-node use-case where it will probably make sense to use NodeNumberJobResource, we also need to set a total number of mpiprocs (which will be used in the srun command) that is unequal to the number of requested CPUs. Moreover, CPUs are not even requested as --cpus and --nodes are mutually exclusive in the mulit-node case. Therefore, and based on the arguments in the beginning, I'd vote for already keeping the two different resource arguments num_mpiprocs and num_cpus in this PR.

True @t-reents, as we discussed, I'll add it back.
My plan is keep that change (revert back to using NodeNumberJobResource) as a separate PR. Since num_cpus compatible with regular use case and NodeNumberJobResource is matched with this multi-node experiment feature as we know.

unkcpz added 7 commits April 30, 2024 16:40

Use same option for switch hyper threading for worker as hq

8c425dc

Make memory configurable

405ccb3

Fix problem that resources setting twice

cf2dbb3

The resource can be set with #HQ or through CLI, but not both. The CLI options are removed from submit command.

Update pyproject.toml and use ruff

ec6aa22

Improve base resource test

e409292

Add test for job script generate and test with real hq

950d8d1

Add test for parsing submit output to jobid

fd11895

unkcpz marked this pull request as draft June 5, 2024 08:24

unkcpz force-pushed the fea/mem-allocation branch 5 times, most recently from 5b56629 to 56f8345 Compare June 6, 2024 09:51

unkcpz added 4 commits June 7, 2024 21:47

Add more CLI to control the remote

92fad45

- Change command to aiida-hq - add aiida-hq install <computer> - fix start server timeout problem - pre-commit lint

Add test for parse job list

3505b46

Add test for CLI server

40bb573

Add test for CLI install

e8f2da7

unkcpz force-pushed the fea/mem-allocation branch 3 times, most recently from 58e77e3 to d525ff1 Compare June 7, 2024 20:06

pre-commit fixes and trigger from pre-commit.ci action

7981ac9

unkcpz force-pushed the fea/mem-allocation branch 2 times, most recently from fef7c55 to 8bffb31 Compare June 7, 2024 20:23

Fix doc build

de5023c

a

unkcpz force-pushed the fea/mem-allocation branch from 8db9734 to 181a29b Compare June 7, 2024 20:27

unkcpz mentioned this pull request Jul 3, 2024

Multiple workers-per-allocation not working It4innovations/hyperqueue#443

Closed

unkcpz marked this pull request as ready for review July 14, 2024 07:45

mbercx approved these changes Jul 15, 2024

View reviewed changes

t-reents reviewed Jul 16, 2024

View reviewed changes

Only test local transport since GH not by default support ssh

9e4864e

unkcpz force-pushed the fea/mem-allocation branch from 181a29b to 9e4864e Compare July 17, 2024 10:11

unkcpz merged commit 2cfabba into aiidateam:main Jul 17, 2024
3 checks passed

unkcpz deleted the fea/mem-allocation branch July 17, 2024 10:12

This was referenced Jul 17, 2024

Use same option for switch hyper threading for worker as hq #17

Closed

todos leftover from #18 #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring, bug fixes and adding tests #18

Refactoring, bug fixes and adding tests #18

unkcpz commented Jun 5, 2024 •

edited

Loading

unkcpz commented Jul 14, 2024 •

edited

Loading

mbercx left a comment

mbercx commented Jul 15, 2024

unkcpz commented Jul 15, 2024

t-reents Jul 16, 2024

unkcpz Jul 16, 2024

Refactoring, bug fixes and adding tests #18

Refactoring, bug fixes and adding tests #18

Conversation

unkcpz commented Jun 5, 2024 • edited Loading

unkcpz commented Jul 14, 2024 • edited Loading

mbercx left a comment

Choose a reason for hiding this comment

mbercx commented Jul 15, 2024

unkcpz commented Jul 15, 2024

t-reents Jul 16, 2024

Choose a reason for hiding this comment

unkcpz Jul 16, 2024

Choose a reason for hiding this comment

unkcpz commented Jun 5, 2024 •

edited

Loading

unkcpz commented Jul 14, 2024 •

edited

Loading