Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SLURM-compatibility to Benchexec #995

Merged
merged 62 commits into from
Feb 20, 2024
Merged

Adding SLURM-compatibility to Benchexec #995

merged 62 commits into from
Feb 20, 2024

Conversation

leventeBajczi
Copy link
Contributor

I've added experimental support for the SLURM workload manager for HPC as a new contrib/ entry. I have no idea whether this contribution is interesting for other users of Benchexec, but for Hungarian researchers, we can request access to Komondor (https://hpc.kifu.hu/hu/komondor) for research projects, and it uses SLURM as a way of scheduling jobs across its nodes.

As an overview of what I've implemented (I based these modifications on the AWS integration):

  1. A new benchmarking entrypoint contrib/slurm-benchmark.py, which enables SLURM-compatibility via the flag --slurm
  2. A new executor contrib/slurm/slurmexecutor.py, which takes a benchmark and runs it using srun, possibly in a singularity container (for dependency management).

Currently, most of the stuff works (besides keeping resulting files except for the log, but that is not that hard to implement from here), but I feel like it is very easy for the integration to break. I'm wondering whether it is worth putting in some extra effort to make this more production-ready. I'm eagerly waiting for some feedback on this; I'd be happy to work on it further if there is interest.

@PhilippWendler
Copy link
Member

I am pretty sure that there are also other people who would like to use it, so this is welcome!

I know about Slurm, but do not have an installation here, so I can't test or maintain this. But we have the contrib/ directory precisely for sharing such additional components outside of the supported Benchxec, and I would like to merge this here. Ideally you would be willing to somewhat maintain in the future?

The general architecture of this PR matches what we do in similar cases and is certainly what I would have recommended.

So besides what you think should be added, I would focus mostly on some documentation.

Can you explain a little bit more on the effect of using Slurm for users who are familar with BenchExec? I assume, for example, that stuff like core allocations and measurements are handled purely by Slurm, so users wouldn't get what they know from BenchExec, but instead what Slurm provides?

Are there any particular requirements beyond a standard Slurm installation or do users need to do something specific?

Ideally these questions together with an overview of supported/unsupported features of BenchExec would be discussed in a readme.

@leventeBajczi
Copy link
Contributor Author

Sure, I'd be happy to maintain it! I'll prioritize writing a documentation first.

Everything resource-related is handled by SLURM: both the consraints as well as the measurement. SLURM seems to use cgroups to keep an account of resource usage so hopefully it's reliable, but I do have some problems right now with the resolution of these paramteres - for example, time limits are handled to the nearest minute on the Komondor instance. Of course, benchexec will flag them as timeouts (CPU time > time limit), but this represents quite a big overhead of additional computation.

Besides testing them with our own tool (Theta), do you have any suggestions on what/how to test?

@PhilippWendler
Copy link
Member

I don't have any particular testing suggestions. If you want to test more tools, you could relatively easily download the SV-COMP participants and try them? All data required for this is online, but I wouldn't require this as condition for merging this PR.

@leventeBajczi
Copy link
Contributor Author

leventeBajczi commented Feb 20, 2024

I think I addressed all of your comments, thanks for the thorough review! I ran some tests just now to see if I've broken anything, and everything seems to work.
I've left two of the threads unresolved where I still need some input from you, but otherwise, I think everything seems to be in order. GitHub failed to load your answers, I can see them now.

@leventeBajczi
Copy link
Contributor Author

I've added some tweaks, ran some tests, and everything seems to be in order. If you agree, please go ahead with the merge. Thanks!

contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
@leventeBajczi
Copy link
Contributor Author

I've reworked the commands to not use shells at all, please re-check. Hopefully this will be more robust.
Also, I checked, and singularity, similarly to docker and other containerization tools, will handle all params after the name of the container as params to pass to the container. So singularity will work well. I'm unsure about non-singularity runs, but in my testing, it seems to work just fine, and the command receives the params that overlap with that of srun.

Copy link
Member

@PhilippWendler PhilippWendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The argument handling should be safe now, which is a great improvement.

Last round of minor stuff, and then only the question about the desired Python compatibility is left.

contrib/slurm/slurmexecutor.py Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
contrib/slurm/slurmexecutor.py Outdated Show resolved Hide resolved
@PhilippWendler
Copy link
Member

Thanks again for the submission, and in particular for your responsiveness and fast integration of suggestions!

@PhilippWendler PhilippWendler merged commit 71c766d into sosy-lab:main Feb 20, 2024
3 of 4 checks passed
@leventeBajczi leventeBajczi deleted the slurm branch February 20, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants