-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Hq as light scheduler for docker image (#795)
In the docker image, the hyperqueue is pre-configured and replace the local.direct scheduler to limit the number of CPUs to be used when there are multiple calculations. The number of CPU and memory information are read from cgroups and set for the computer as default. These information can later be used for set up the default number of resource to be used for the QeApp. The number of CPU is set to be floor(ncpus) of the container, the goal is to have some amount of cpus to deal with system response specifically. - also include a small refactoring on the computer/code setup.
- Loading branch information
Showing
8 changed files
with
239 additions
and
63 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# computer | ||
verdi computer show ${HQ_COMPUTER} || verdi computer setup \ | ||
--non-interactive \ | ||
--label "${HQ_COMPUTER}" \ | ||
--description "local computer with hyperqueue scheduler" \ | ||
--hostname "localhost" \ | ||
--transport core.local \ | ||
--scheduler hyperqueue \ | ||
--work-dir /home/${NB_USER}/aiida_run/ \ | ||
--mpirun-command "mpirun -np {num_cpus}" | ||
|
||
verdi computer configure core.local "${HQ_COMPUTER}" \ | ||
--non-interactive \ | ||
--safe-interval 5.0 | ||
|
||
# disable the localhost which is set in base image | ||
verdi computer disable localhost aiida@localhost |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# NOTE: this cgroup folder hierachy is based on cgroupv2 | ||
# if the container is open in system which has cgroupv1 the image build procedure will fail. | ||
# Since the image is mostly for demo server where we know the machine and OS I supposed | ||
# it should have cgroupv2 (> Kubernetes v1.25). | ||
# We only build the server for demo server so it does not require user to have new cgroup. | ||
# But for developers, please update your cgroup version to v2. | ||
# See: https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2 | ||
|
||
# computer memory from runtime | ||
MEMORY_LIMIT=$(cat /sys/fs/cgroup/memory.max) | ||
|
||
if [ "$MEMORY_LIMIT" = "max" ]; then | ||
MEMORY_LIMIT=4096 | ||
echo "No memory limit set, use 4GiB" | ||
else | ||
MEMORY_LIMIT=$(echo "scale=0; $MEMORY_LIMIT / (1024 * 1024)" | bc) | ||
echo "Memory Limit: ${MEMORY_LIMIT} MiB" | ||
fi | ||
|
||
# Compute number of cpus allocated to the container | ||
CPU_LIMIT=$(awk '{print $1}' /sys/fs/cgroup/cpu.max) | ||
CPU_PERIOD=$(awk '{print $2}' /sys/fs/cgroup/cpu.max) | ||
|
||
if [ "$CPU_PERIOD" -ne 0 ]; then | ||
CPU_NUMBER=$(echo "scale=2; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
echo "Number of CPUs allocated: $CPU_NUMBER" | ||
|
||
# for HQ setting round to integer number of CPUs, the left are for system tasks | ||
CPU_LIMIT=$(echo "scale=0; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
else | ||
# if no limit (with local OCI without setting cpu limit, use all CPUs) | ||
CPU_LIMIT=$(nproc) | ||
echo "No CPU limit set" | ||
fi | ||
|
||
# Start hq server with a worker | ||
run-one-constantly hq server start 1>$HOME/.hq-stdout 2>$HOME/.hq-stderr & | ||
run-one-constantly hq worker start --cpus=${CPU_LIMIT} --resource "mem=sum(${MEMORY_LIMIT})" --no-detect-resources & | ||
|
||
# Reset the default memory_per_machine and default_mpiprocs_per_machine | ||
# c.set_default_mpiprocs_per_machine = ${CPU_LIMIT} | ||
# c.set_default_memery_per_machine = ${MEMORY_LIMIT} | ||
|
||
# Same as original localhost set job poll interval to 2.0 secs | ||
# In addition, set default mpiprocs and memor per machine | ||
# TODO: this will be run every time the container start, we need a lock file to prevent it. | ||
job_poll_interval="2.0" | ||
computer_name=${HQ_COMPUTER} | ||
python -c " | ||
from aiida import load_profile; from aiida.orm import load_computer; | ||
load_profile(); | ||
load_computer('${computer_name}').set_minimum_job_poll_interval(${job_poll_interval}) | ||
load_computer('${computer_name}').set_default_mpiprocs_per_machine(${CPU_LIMIT}) | ||
load_computer('${computer_name}').set_default_memory_per_machine(${MEMORY_LIMIT}) | ||
" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.