-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SLURM-compatibility to Benchexec #995
Conversation
I am pretty sure that there are also other people who would like to use it, so this is welcome! I know about Slurm, but do not have an installation here, so I can't test or maintain this. But we have the The general architecture of this PR matches what we do in similar cases and is certainly what I would have recommended. So besides what you think should be added, I would focus mostly on some documentation. Can you explain a little bit more on the effect of using Slurm for users who are familar with BenchExec? I assume, for example, that stuff like core allocations and measurements are handled purely by Slurm, so users wouldn't get what they know from BenchExec, but instead what Slurm provides? Are there any particular requirements beyond a standard Slurm installation or do users need to do something specific? Ideally these questions together with an overview of supported/unsupported features of BenchExec would be discussed in a readme. |
Sure, I'd be happy to maintain it! I'll prioritize writing a documentation first. Everything resource-related is handled by SLURM: both the consraints as well as the measurement. SLURM seems to use cgroups to keep an account of resource usage so hopefully it's reliable, but I do have some problems right now with the resolution of these paramteres - for example, time limits are handled to the nearest minute on the Komondor instance. Of course, benchexec will flag them as timeouts (CPU time > time limit), but this represents quite a big overhead of additional computation. Besides testing them with our own tool (Theta), do you have any suggestions on what/how to test? |
I don't have any particular testing suggestions. If you want to test more tools, you could relatively easily download the SV-COMP participants and try them? All data required for this is online, but I wouldn't require this as condition for merging this PR. |
I think I addressed all of your comments, thanks for the thorough review! I ran some tests just now to see if I've broken anything, and everything seems to work. |
I've added some tweaks, ran some tests, and everything seems to be in order. If you agree, please go ahead with the merge. Thanks! |
I've reworked the commands to not use shells at all, please re-check. Hopefully this will be more robust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The argument handling should be safe now, which is a great improvement.
Last round of minor stuff, and then only the question about the desired Python compatibility is left.
Thanks again for the submission, and in particular for your responsiveness and fast integration of suggestions! |
I've added experimental support for the SLURM workload manager for HPC as a new
contrib/
entry. I have no idea whether this contribution is interesting for other users of Benchexec, but for Hungarian researchers, we can request access to Komondor (https://hpc.kifu.hu/hu/komondor) for research projects, and it uses SLURM as a way of scheduling jobs across its nodes.As an overview of what I've implemented (I based these modifications on the AWS integration):
contrib/slurm-benchmark.py
, which enables SLURM-compatibility via the flag--slurm
contrib/slurm/slurmexecutor.py
, which takes a benchmark and runs it usingsrun
, possibly in asingularity
container (for dependency management).Currently, most of the stuff works (besides keeping resulting files except for the log, but that is not that hard to implement from here), but I feel like it is very easy for the integration to break. I'm wondering whether it is worth putting in some extra effort to make this more production-ready. I'm eagerly waiting for some feedback on this; I'd be happy to work on it further if there is interest.