-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring, bug fixes and adding tests #18
Conversation
The resource can be set with #HQ or through CLI, but not both. The CLI options are removed from submit command.
5b56629
to
56f8345
Compare
- Change command to aiida-hq - add aiida-hq install <computer> - fix start server timeout problem - pre-commit lint
58e77e3
to
d525ff1
Compare
fef7c55
to
8bffb31
Compare
8db9734
to
181a29b
Compare
I want to merge the PR first so I can keep on working on support multi-node feature needed by Timo of on using this in demo server deployment. Keep it open too long also means need to rework after aiidateam/aiida-core#6043. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a large PR you have there, @unkcpz. ^^ I didn't go through the code line-by-line, but in general everything looks very good, thanks for the hard work! 🚀
Might want to split up the changes in multiple commits, but I would understand if you don't want to go through the hassle. Also fine to squash and merge and simply have a single commit that describes all the changes. |
Thanks @mbercx!!
I'll rebase to less commits and reword a bit the commit message and do a rebase merge. I did one rebase before and try to keep every commit independent as possible. |
} | ||
|
||
|
||
class HyperQueueJobResource(JobResource): | ||
"""Class for HyperQueue job resources.""" | ||
|
||
_default_fields = ('num_mpiprocs', 'num_cores', 'memory_Mb') | ||
_default_fields = ("num_cpus", "memory_mb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a 100% sure about the change of removing the num_mpiprocs
resource and only using num_cpus
. I think it makes sense to use this setup on CPU systems, e.g. requesting 24 cores and submitting with srun -n 24 ...
using 24 tasks. However, our recommended setup on Lumi uses the following resources: We have 8 GPUs and 56 CPUs, i.e. 7 CPUs per GPU. One would request 8 tasks per node, i.e. one per GPU.
This being said, we would request 56 CPUs from HQ but still want to submit using srun -n 8 ...
. In this scenario, num_mpiprocs
is not equal to num_cores
.
Concerning the multi-node usage, I've already started a draft locally to test the use-case. If you don't mind, I'll open a PR once this is merged. We can use it as a starting point and work on it together.
Having in mind the multi-node use-case where it will probably make sense to use NodeNumberJobResource
, we also need to set a total number of mpiprocs
(which will be used in the srun
command) that is unequal to the number of requested CPUs. Moreover, CPUs are not even requested as --cpus
and --nodes
are mutually exclusive in the mulit-node case. Therefore, and based on the arguments in the beginning, I'd vote for already keeping the two different resource arguments num_mpiprocs
and num_cpus
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True @t-reents, as we discussed, I'll add it back.
My plan is keep that change (revert back to using NodeNumberJobResource
) as a separate PR. Since num_cpus
compatible with regular use case and NodeNumberJobResource
is matched with this multi-node experiment feature as we know.
181a29b
to
9e4864e
Compare
This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:
hq
using the fixture from hyperqueue repo.The major change I made in terms of resource setting is I didn't use
num_mpiprocs
and renamenum_cores
->num_cpus
, renamememory_Mb
->memory_mb
.The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from either
ParEnvJobResource
as SGE type scheduler norNodeNumberJobResource
. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercxIssues:
OSError: Failure
)HQ_SERVER_DIR
explicitly, to distinguish multiple server (see Distinguish hq-server folder to have multiple server for different machines share the same home It4innovations/hyperqueue#719)Must have features:
NodeNumberJobResource
as parent and provide option for use case on LUMI that will require multinode functionality of HQ.-N
is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.