Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring, bug fixes and adding tests #18

Merged
merged 14 commits into from
Jul 17, 2024

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Jun 5, 2024

This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:

  • correctly support memory setup for resources.
  • support turn on hyperthreading with latest version of hyperqueue.
  • Use ruff to lint
  • Fix the submit bug for hq > 0.12 that resources are configured twice in job script and submit command.
  • Support install hq to remote computer over CLI.
  • WIP adding unit tests with submit to real hq using the fixture from hyperqueue repo.

The major change I made in terms of resource setting is I didn't use num_mpiprocs and rename num_cores -> num_cpus, rename memory_Mb -> memory_mb.
The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from either ParEnvJobResource as SGE type scheduler nor NodeNumberJobResource. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercx

Issues:

Must have features:

  • Use NodeNumberJobResource as parent and provide option for use case on LUMI that will require multinode functionality of HQ.
    • how to tell alloc to fire workers in the same group? Every new multinode run is managed to a certain group. Which means if -N is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.

@unkcpz unkcpz marked this pull request as draft June 5, 2024 08:24
@unkcpz unkcpz force-pushed the fea/mem-allocation branch 5 times, most recently from 5b56629 to 56f8345 Compare June 6, 2024 09:51
unkcpz added 4 commits June 7, 2024 21:47
- Change command to aiida-hq
- add aiida-hq install <computer>
- fix start server timeout problem
- pre-commit lint
@unkcpz unkcpz force-pushed the fea/mem-allocation branch 3 times, most recently from 58e77e3 to d525ff1 Compare June 7, 2024 20:06
@unkcpz unkcpz force-pushed the fea/mem-allocation branch 2 times, most recently from fef7c55 to 8bffb31 Compare June 7, 2024 20:23
@unkcpz unkcpz force-pushed the fea/mem-allocation branch from 8db9734 to 181a29b Compare June 7, 2024 20:27
@unkcpz unkcpz marked this pull request as ready for review July 14, 2024 07:45
@unkcpz
Copy link
Member Author

unkcpz commented Jul 14, 2024

I want to merge the PR first so I can keep on working on support multi-node feature needed by Timo of on using this in demo server deployment.
It will not affect current user if package is installed from pypi. We can decide when to release a new one after check if everything is ready. For me, I had it tested and run millions on Eiger and Daint and on demo server.
Let me know if you don't agree @mbercx.

Keep it open too long also means need to rework after aiidateam/aiida-core#6043.

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a large PR you have there, @unkcpz. ^^ I didn't go through the code line-by-line, but in general everything looks very good, thanks for the hard work! 🚀

@mbercx
Copy link
Member

mbercx commented Jul 15, 2024

Might want to split up the changes in multiple commits, but I would understand if you don't want to go through the hassle. Also fine to squash and merge and simply have a single commit that describes all the changes.

@unkcpz
Copy link
Member Author

unkcpz commented Jul 15, 2024

Thanks @mbercx!!

Also fine to squash and merge and simply have a single commit that describes all the changes.

I'll rebase to less commits and reword a bit the commit message and do a rebase merge. I did one rebase before and try to keep every commit independent as possible.

}


class HyperQueueJobResource(JobResource):
"""Class for HyperQueue job resources."""

_default_fields = ('num_mpiprocs', 'num_cores', 'memory_Mb')
_default_fields = ("num_cpus", "memory_mb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbercx @unkcpz

I'm not a 100% sure about the change of removing the num_mpiprocs resource and only using num_cpus. I think it makes sense to use this setup on CPU systems, e.g. requesting 24 cores and submitting with srun -n 24 ... using 24 tasks. However, our recommended setup on Lumi uses the following resources: We have 8 GPUs and 56 CPUs, i.e. 7 CPUs per GPU. One would request 8 tasks per node, i.e. one per GPU.

This being said, we would request 56 CPUs from HQ but still want to submit using srun -n 8 .... In this scenario, num_mpiprocs is not equal to num_cores.

Concerning the multi-node usage, I've already started a draft locally to test the use-case. If you don't mind, I'll open a PR once this is merged. We can use it as a starting point and work on it together.

Having in mind the multi-node use-case where it will probably make sense to use NodeNumberJobResource, we also need to set a total number of mpiprocs (which will be used in the srun command) that is unequal to the number of requested CPUs. Moreover, CPUs are not even requested as --cpus and --nodes are mutually exclusive in the mulit-node case. Therefore, and based on the arguments in the beginning, I'd vote for already keeping the two different resource arguments num_mpiprocs and num_cpus in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True @t-reents, as we discussed, I'll add it back.
My plan is keep that change (revert back to using NodeNumberJobResource) as a separate PR. Since num_cpus compatible with regular use case and NodeNumberJobResource is matched with this multi-node experiment feature as we know.

@unkcpz unkcpz force-pushed the fea/mem-allocation branch from 181a29b to 9e4864e Compare July 17, 2024 10:11
@unkcpz unkcpz merged commit 2cfabba into aiidateam:main Jul 17, 2024
3 checks passed
@unkcpz unkcpz deleted the fea/mem-allocation branch July 17, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants