Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SLURM-compatibility to Benchexec #995

Merged
merged 62 commits into from
Feb 20, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
49f52ed
Added slurm executor
leventeBajczi Feb 17, 2024
687cdb4
Fixed memory calculation
leventeBajczi Feb 17, 2024
a53f6d6
Fixed memory calculation #2
leventeBajczi Feb 17, 2024
7b11e4b
Implemented slurm executor
leventeBajczi Feb 17, 2024
9577b1c
Fixed timelimit
leventeBajczi Feb 17, 2024
2496c19
Fixed subprocess
leventeBajczi Feb 17, 2024
be275ea
Fixed stdout
leventeBajczi Feb 17, 2024
3f72aa8
Fixed stdout
leventeBajczi Feb 17, 2024
f2f385b
Added logging
leventeBajczi Feb 17, 2024
754ab98
Fixed memory limit
leventeBajczi Feb 17, 2024
4f0f006
Adding 6 lines of metadata to beginning of file
leventeBajczi Feb 17, 2024
ea9b778
Reformatted file
leventeBajczi Feb 17, 2024
4d5af97
Cleared up commands
leventeBajczi Feb 18, 2024
e7cdee6
Formatted file
leventeBajczi Feb 18, 2024
6a7dbfb
Moved --slurm out of main benchexec code
leventeBajczi Feb 18, 2024
00ed378
Fixed path
leventeBajczi Feb 18, 2024
4e837d3
Fixed formatting command
leventeBajczi Feb 18, 2024
f78ff8f
Using --no-home instead of specifying -B
leventeBajczi Feb 18, 2024
a0c662d
Added --contain
leventeBajczi Feb 18, 2024
394a7c4
Added -B $PWD:$HOME
leventeBajczi Feb 18, 2024
2ea64b9
Added log
leventeBajczi Feb 18, 2024
82d960a
ntasks=1
leventeBajczi Feb 18, 2024
26d4472
removed unused import
leventeBajczi Feb 18, 2024
18ffb45
Added fusemount options
leventeBajczi Feb 18, 2024
c88e5dc
Added temp files
leventeBajczi Feb 18, 2024
1fbd3e4
Reformatted
leventeBajczi Feb 18, 2024
0d7246b
Updated copyright, added readme
leventeBajczi Feb 19, 2024
9f62e8e
Updated copyright
leventeBajczi Feb 19, 2024
26d1933
Added myself to the list of contributors
leventeBajczi Feb 19, 2024
501e43b
Added contact info to top of README
leventeBajczi Feb 19, 2024
9b2f445
Updated README regarding contact
leventeBajczi Feb 20, 2024
493ce92
Update README with better description on the workflow.
leventeBajczi Feb 20, 2024
6a5d5e6
Moved contributor entry to correct place in alphabetical order
leventeBajczi Feb 20, 2024
a9f1c7b
Added links in preliminaries
leventeBajczi Feb 20, 2024
67dcd61
Updated requirement description
leventeBajczi Feb 20, 2024
3f28a45
Moved disclaimer to limitations instead
leventeBajczi Feb 20, 2024
17f02ef
REmoved confusing documentation comment
leventeBajczi Feb 20, 2024
6b6cd65
Modified help text
leventeBajczi Feb 20, 2024
2a2ffe0
Reworked param passing
leventeBajczi Feb 20, 2024
6425071
Removed starttime
leventeBajczi Feb 20, 2024
a7501c2
Moved to factory method for ProcessExitCode
leventeBajczi Feb 20, 2024
61f7fe7
Replaced bash pipes with python code, moved regexes
leventeBajczi Feb 20, 2024
de4e23c
Removed unused import
leventeBajczi Feb 20, 2024
7307acc
Fixed missing str()
leventeBajczi Feb 20, 2024
8ef0505
Added extra logging
leventeBajczi Feb 20, 2024
28da4ac
Fixed bug in exit code parsing
leventeBajczi Feb 20, 2024
86574f2
Fixed quoting issue
leventeBajczi Feb 20, 2024
46247d7
Minor fixes: typo, logging
leventeBajczi Feb 20, 2024
e6b108e
Formatting fix
leventeBajczi Feb 20, 2024
f110bcc
Updated README and requering --no-hyperthreading
leventeBajczi Feb 20, 2024
2e6e2d9
Added exception to when exit code is not parsed
leventeBajczi Feb 20, 2024
bb9f5bf
Added new line to limitations
leventeBajczi Feb 20, 2024
9b86b5f
Formatting fix
leventeBajczi Feb 20, 2024
d742e30
Minor fixes from feedback
leventeBajczi Feb 20, 2024
ae7ebcc
Formatted file
leventeBajczi Feb 20, 2024
255eb6d
Not using shell anymore
leventeBajczi Feb 20, 2024
4368b34
str() wrap around ints
leventeBajczi Feb 20, 2024
e07ce23
Added comma
leventeBajczi Feb 20, 2024
24dd37e
str() wrap around ints
leventeBajczi Feb 20, 2024
1563694
Implemented some fixes
leventeBajczi Feb 20, 2024
d9fba5d
Formatted file
leventeBajczi Feb 20, 2024
e6e9b92
better handling of bad scratchdir
leventeBajczi Feb 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ Maintainer: [Philipp Wendler](https://www.philippwendler.de)

Contributors:
- [Aditya Arora](https://github.com/alohamora)
- [Levente Bajczi](https://github.com/leventeBajczi)
- [Dirk Beyer](https://www.sosy-lab.org/people/beyer/)
- [Laura Bschor](https://github.com/laurabschor)
- [Thomas Bunk](https://github.com/TBunk)
Expand All @@ -122,7 +123,6 @@ Contributors:
- [Thomas Stieglmaier](https://stieglmaier.me/)
- [Martin Yankov](https://github.com/marto97)
- [Ilja Zakharov](https://github.com/IljaZakharov)
- [Levente Bajczi](https://github.com/leventeBajczi)
- and [lots of more people who integrated tools into BenchExec](https://github.com/sosy-lab/benchexec/graphs/contributors)

### Users of BenchExec
Expand Down
6 changes: 3 additions & 3 deletions contrib/slurm-benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@

class Benchmark(benchexec.benchexec.BenchExec):
"""
An extension of BenchExec for use with CPAchecker
to execute benchmarks using SLURM, optionally via Singularity.
An extension of BenchExec to execute benchmarks using SLURM,
optionally via Singularity.
"""

def create_argument_parser(self):
Expand All @@ -54,7 +54,7 @@ def create_argument_parser(self):
dest="scratchdir",
type=str,
default="./",
help="The path to the singularity .sif file to use. Will bind $PWD to $HOME when run.",
help="The directory where temporary directories can be created for use within singularity.",
)

return parser
Expand Down
20 changes: 10 additions & 10 deletions contrib/slurm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ SPDX-License-Identifier: Apache-2.0

This Python script extends BenchExec, a benchmarking framework, to facilitate benchmarking via SLURM, optionally using a Singularity container.

In case of problems, contact / tag in an issue: [Levente Bajczi](https://github.com/leventeBajczi)
In case of problems, please tag in an [issue](https://github.com/sosy-lab/benchexec/issues/new/choose): [Levente Bajczi](https://github.com/leventeBajczi) (@leventeBajczi).

## Preliminaries

* *SLURM* is an open-source job scheduling and workload management system used primarily in high-performance computing (HPC) environments.
* *Singularity* is a containerization platform designed for scientific and high-performance computing (HPC) workloads, providing users with a reproducible and portable environment for running applications and workflows.
* [SLURM](https://slurm.schedmd.com/documentation.html) is an open-source job scheduling and workload management system used primarily in high-performance computing (HPC) environments.
* [Singularity](https://docs.sylabs.io/guides/latest/user-guide/) is a containerization platform designed for scientific and high-performance computing (HPC) workloads, providing users with a reproducible and portable environment for running applications and workflows.

## Requirements

* Singularity (optional), tested with `singularity-ce version 4.0.1`
* SLURM, tested with `slurm 22.05.7` on `Red Hat Enterprise Linux 8.6 (Ootpa)`, kernel version `4.18.0-372.9.1.el8.x86_64`
* SLURM, tested with `slurm 22.05.7`, should work within `22.x.x`
* Singularity (optional), tested with `singularity-ce version 4.0.1`, should work within `4.x.x`

## Usage
1. Run the script with Python 3:
Expand All @@ -38,7 +38,7 @@ In case of problems, contact / tag in an issue: [Levente Bajczi](https://github.

## Overview of the Workflow

The workflow is based on `localexecution.py`, and uses the same general layout. However, instead of delegating to `runexec`, it delegates to `srun`.
This works similarly to BenchExec, however, instead of delegating each run to `runexec`, it delegates to `srun` from SLURM.

1. If the `--singularity` option is given, the script wraps the command to run in a container. This is useful for dependency management (in most HPC environments, arbitrary package installations are frowned upon). For a simple container, use the following:

Expand All @@ -60,8 +60,6 @@ The workflow is based on `localexecution.py`, and uses the same general layout.
* `--no-home`: Do not bind the home directory
* `-B {tempdir}:/overlay`: Bind the temporary directory to `/overlay` (must be writeable)
* `--fusemount "container:fuse-overlayfs -o lowerdir=/lower -o upperdir=/overlay/upper -o workdir=/overlay/work $HOME"`: mount an overlay filesystem at $HOME, where modifications go in the temp dir but files can be read from the current dir

Modifications to the script are necessary if the user on the host and the container differ (due to using $HOME). Also, files not under the current directory will not be visible.

2. Currently, the following parameters are passed to `srun` (calculated from the benchmark's parameters):
* `-t <hh:mm:ss>` CPU timelimit (generally, SLURM will round up to nearest minute)
Expand All @@ -82,5 +80,7 @@ Currently, there are the following limitations compared to local benchexec:

1. No advanced resource constraining / monitoring: only CPU time, CPU core and memory limits are handled, and only CPU time, wall time, and memory usage are monitored.
2. No exotic paths in the command are handled: only the current working directory and its children are visible in the container
3. Without singularity, no constraint is placed on the resulting files of the runs: this will populate the current directory with all the output files of all the runs.
4. For timed-out runs, where SLURM terminated the run, no CPU time values are available.
3. The user on the host and the container should not differ (due to using $HOME in the commands).
4. Without singularity, no constraint is placed on the resulting files of the runs: this will populate the current directory with all the output files of all the runs.
5. For timed-out runs, where SLURM terminated the run, no CPU time values are available.
6. The executor only works with hyperthreading disabled, due to the inability to query nodes about the number of threads per core. Assuming it's always 2 is risky, as it may not hold true universally. Consequently, because we can only request whole cores from SLURM instead of threads, we must divide the requested number of threads by the threads-per-core value, which is unknown if hyperthreading could be enabled.
98 changes: 58 additions & 40 deletions contrib/slurm/slurmexecutor.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import threading
import time

from benchexec import benchexec, BenchExecException, tooladapter
from benchexec import BenchExecException, tooladapter
from benchexec.util import ProcessExitCode

sys.dont_write_bytecode = True # prevent creation of .pyc files
Expand All @@ -40,9 +40,10 @@ def get_system_info():


def execute_benchmark(benchmark, output_handler):
num_of_cores = benchmark.rlimits.cpu_cores
mem_limit = benchmark.rlimits.memory
run_sets_executed = 0

assert (
not benchmark.config.use_hyperthreading
), "SLURM can only work properly without hyperthreading enabled. See README.md for details."
leventeBajczi marked this conversation as resolved.
Show resolved Hide resolved

for runSet in benchmark.run_sets:
if STOPPED_BY_INTERRUPT:
Expand All @@ -57,13 +58,10 @@ def execute_benchmark(benchmark, output_handler):
)

else:
run_sets_executed += 1
_execute_run_set(
runSet,
benchmark,
output_handler,
num_of_cores,
mem_limit,
)

output_handler.output_after_benchmark(STOPPED_BY_INTERRUPT)
Expand All @@ -73,8 +71,6 @@ def _execute_run_set(
runSet,
benchmark,
output_handler,
num_of_cores,
mem_limit,
):
# get times before runSet
walltime_before = time.monotonic()
Expand All @@ -98,9 +94,7 @@ def run_finished():
for i in range(min(benchmark.num_of_threads, unfinished_runs)):
if STOPPED_BY_INTERRUPT:
break
WORKER_THREADS.append(
_Worker(benchmark, num_of_cores, mem_limit, output_handler, run_finished)
)
WORKER_THREADS.append(_Worker(benchmark, output_handler, run_finished))

# wait until workers are finished (all tasks done or STOPPED_BY_INTERRUPT)
for worker in WORKER_THREADS:
Expand All @@ -110,13 +104,11 @@ def run_finished():
# get times after runSet
walltime_after = time.monotonic()
usedWallTime = walltime_after - walltime_before
usedCpuTime = 1000 # TODO

if STOPPED_BY_INTERRUPT:
output_handler.set_error("interrupted", runSet)
output_handler.output_after_run_set(
runSet,
cputime=usedCpuTime,
walltime=usedWallTime,
)

Expand All @@ -133,14 +125,10 @@ class _Worker(threading.Thread):

working_queue = queue.Queue()

def __init__(
self, benchmark, my_cpus, my_memory_nodes, output_handler, run_finished_callback
):
def __init__(self, benchmark, output_handler, run_finished_callback):
threading.Thread.__init__(self) # constuctor of superclass
self.run_finished_callback = run_finished_callback
self.benchmark = benchmark
self.my_cpus = my_cpus
self.my_memory_nodes = my_memory_nodes
self.output_handler = output_handler
self.setDaemon(True)

Expand Down Expand Up @@ -182,15 +170,10 @@ def execute(self, run):
for i in range(6):
f.write(os.linesep)

timelimit = self.benchmark.rlimits.cputime

run_result = run_slurm(
benchmark,
args,
run.log_file,
timelimit,
self.my_cpus,
benchmark.rlimits.memory,
)

except KeyboardInterrupt:
Expand All @@ -212,7 +195,14 @@ def execute(self, run):
return None


def run_slurm(benchmark, args, log_file, timelimit, cpus, memory):
jobid_pattern = re.compile(r"job (\d*) queued")


def run_slurm(benchmark, args, log_file):
timelimit = benchmark.rlimits.cputime
cpus = benchmark.rlimits.cpu_cores
memory = benchmark.rlimits.memory

srun_timelimit_h = int(timelimit / 3600)
srun_timelimit_m = int((timelimit % 3600) / 60)
srun_timelimit_s = int(timelimit % 60)
Expand All @@ -230,9 +220,9 @@ def run_slurm(benchmark, args, log_file, timelimit, cpus, memory):
tool_command = " ".join(args)
singularity_command = (
f"singularity exec "
f"-B $PWD:/lower --no-home "
f"-B {tempdir}:/overlay "
f'--fusemount "container:fuse-overlayfs -o lowerdir=/lower -o upperdir=/overlay/upper -o workdir=/overlay/work $HOME" '
f'-B "$PWD":/lower --no-home '
f'-B "{tempdir}":/overlay '
f'--fusemount "container:fuse-overlayfs -o lowerdir=/lower -o upperdir=/overlay/upper -o workdir=/overlay/work ""$HOME""" '
f"{benchmark.config.singularity} {tool_command}"
if benchmark.config.singularity
else tool_command
Expand All @@ -243,17 +233,36 @@ def run_slurm(benchmark, args, log_file, timelimit, cpus, memory):
f"-c {cpus} "
f"-o {log_file} "
f"--mem-per-cpu {mem_per_cpu} "
f"--threads-per-core=1 "
f"--threads-per-core=1 " # --use_hyperthreading=False is always given here
f"--ntasks=1 "
f"{singularity_command}"
)
jobid_command = (
f"{srun_command} 2>&1 | grep -o 'job [0-9]* queued' | grep -o '[0-9]*'"
srun_result = subprocess.run(
["bash", "-c", srun_command],
shell=False,
leventeBajczi marked this conversation as resolved.
Show resolved Hide resolved
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
)
logging.debug(
"srun: returncode: %d, output: %s",
srun_result.returncode,
srun_result.stdout,
)
seff_command = f"seff $({jobid_command})"
jobid_match = jobid_pattern.search(str(srun_result.stdout))
if jobid_match:
jobid = int(jobid_match.group(1))
else:
logging.debug("Jobid not found in stderr, aborting")
stop()
return -1

seff_command = f"seff {jobid}"
logging.debug("Command to run: %s", seff_command)
result = subprocess.run(
["bash", "-c", seff_command], shell=False, stdout=subprocess.PIPE
["bash", "-c", seff_command],
shell=False,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
)

# Runexec would populate the first 6 lines with metadata
Expand All @@ -266,24 +275,29 @@ def run_slurm(benchmark, args, log_file, timelimit, cpus, memory):
exit_code, cpu_time, wall_time, memory_usage = parse_seff(str(result.stdout))

return {
"starttime": benchexec.util.read_local_time(),
"walltime": wall_time,
"cputime": cpu_time,
"memory": memory_usage,
"exitcode": ProcessExitCode(raw=exit_code, value=exit_code, signal=None),
"exitcode": ProcessExitCode.create(value=exit_code),
}


exit_code_pattern = re.compile(r"exit code (\d+)")
cpu_time_pattern = re.compile(r"CPU Utilized: (\d+):(\d+):(\d+)")
wall_time_pattern = re.compile(r"Job Wall-clock time: (\d+):(\d+):(\d+)")
memory_pattern = re.compile(r"Memory Utilized: (\d+\.\d+) MB")


def parse_seff(result):
exit_code_pattern = re.compile(r"State: COMPLETED \(exit code (\d+)\)")
cpu_time_pattern = re.compile(r"CPU Utilized: (\d+):(\d+):(\d+)")
wall_time_pattern = re.compile(r"Job Wall-clock time: (\d+):(\d+):(\d+)")
memory_pattern = re.compile(r"Memory Utilized: (\d+\.\d+) MB")
logging.debug(f"Got output from seff: {result}")
exit_code_match = exit_code_pattern.search(result)
cpu_time_match = cpu_time_pattern.search(result)
wall_time_match = wall_time_pattern.search(result)
memory_match = memory_pattern.search(result)
exit_code = int(exit_code_match.group(1)) if exit_code_match else None
if exit_code_match:
exit_code = int(exit_code_match.group(1))
else:
raise Exception(f"Exit code not matched in output: {result}")
cpu_time = None
if cpu_time_match:
hours, minutes, seconds = map(int, cpu_time_match.groups())
Expand All @@ -294,4 +308,8 @@ def parse_seff(result):
wall_time = hours * 3600 + minutes * 60 + seconds
memory_usage = float(memory_match.group(1)) * 1000000 if memory_match else None
leventeBajczi marked this conversation as resolved.
Show resolved Hide resolved

logging.debug(
f"Exit code: {exit_code}, memory usage: {memory_usage}, walltime: {wall_time}, cpu time: {cpu_time}"
)

return exit_code, cpu_time, wall_time, memory_usage
Loading