Add configuration file for Cedar (#168) #173

bouthilx · 2017-10-16T17:48:30Z

Wait for #166 before merging
Fix issue #168

…andle slurm clusters

Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].

It will need many conversions, not only on resources, so better make it clean.

Slurm has no queues, so PBS option -q is invalid and non-convertible.

$PBS_JOBID was used to set the stdout/err of the job as well as in the commands. Replace them with $SLURM_JOB_ID. Also, workers were accessing os.environ[PBS_JOBID] so we added a second fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.

Slurm cannot be passed environment variables defined locally on command-line like PBS_FILENAME is. To bypass this, we add a definition in the prolog, making PBS_FILENAME available to all commands and epilog. NOTE: We leave PBS_FILENAME definition in command-line too such that any user using $PBS_FILENAME inside a custom pbsFlag can still do so.

PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.

Slurm does not have a equivalent environment variable set like PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed to SBATCH_TIMELIMIT (the environment variable one would use to set --time with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to the prolog to make it available to all commands and epilog. NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings at the time of building the PBS file. We needed to add a walltime_to_seconds helper function to convert HH:MM:SS like strings into seconds, so that SBATCH_TIMELIMIT is set with seconds like PBS_WALLTIME.

It is possible to query the system to see if some commands are available using distutils.spawn.find_executable(command_name). Clusters where more than one launcher are available will still get launchers selected based on string matching. For instance, get_launcher("helios") would always return msub no matter what is available on the system.

It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.

JobGenerators are selected by job_generator_factory based on the cluster's name. We use a more flexible, duck typing approach for Slurm clusters. If cluster name is not known, or not any of the if-case clauses in the factory, then we look at which launchers are available in the system. If it is sbatch, then a SlurmJobGenerator is built, a JobGenerator otherwise.

The command `sacctmgr` fails on some computers (mila01 namely), but the current behavior gives the impression sbatch is simply not available. Printing the stderr makes it more obvious that sbatch should be available, but something is broken behind sacctmgr. It only appears when using -vv options nevertheless.

Adding a script to do automatic verifications to assert validity of the current code. The verifications are not automatic unit-tests, they need automatically checks that the process executed successfully, but the administrator still needs to verify manually, reading the logs, that the requested resources were provided. Verifications can easily be combined, building on top of each others, from complex ones to simpler ones. Here is a list of all the verification currently implemented for slurm clusters: 1. very_simple_task (1 CPU) 2. verify_simple_task_with_one_gpu (1 CPU 1 GPU) 3. verify_simple_task_with_many_gpus (1 CPU X GPU) 4. verify_many_task (X CPU) 5. verify_many_task_with_many_cores (XY CPU) 6. verify_many_task_with_one_gpu (X CPU X GPU) 7. verify_many_task_with_many_gpus (X CPU Y GPU) 8. verify_simple_task_with_autoresume_unneeded (1 CPU) 9. verify_simple_task_with_autoresume_needed (1 CPU) 10. verify_many_task_with_autoresume_needed (X CPU)

My initial though was that get_launcher should raise an error when no launcher is found on the system since there cannot be any job launcher. I realized that this would break the --doNotLaunch option that users may want to use on system with no launcher, just to create the files.

The tests were failing because the account was not specified.

The tests were failing because the account was not specified

coveralls · 2017-10-16T17:50:13Z

Coverage increased (+1.2%) to 95.725% when pulling 2a380f6 on bouthilx:cedar_config into 9133e15 on SMART-Lab:master.

nouiz · 2017-10-17T16:50:50Z

Does this completly fix #168 ?

bouthilx · 2017-10-17T17:32:41Z

@nouiz It's a branch forked from adrien_slurm, the one PR #166 is based upon. Yes it should. The question is more: are those configurations the one we should define or not. We don't have strict queues like torque clusters, so it's up to us to decide how we define/devide queues.

There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.

coveralls · 2017-10-17T20:26:25Z

Coverage increased (+1.2%) to 95.738% when pulling 02823d8 on bouthilx:cedar_config into 9133e15 on SMART-Lab:master.

hantek · 2018-04-12T15:42:20Z

Will this PR be merged in the future?

mgermain · 2018-04-12T18:12:32Z

The whole smartdispatch seems to be somewhat on hold so I can't say for sure but, yes it was planned to be merged after a couple fixes.

…

On Thu, Apr 12, 2018 at 11:42 AM, Jonathan Zhouhan LIN < ***@***.***> wrote: Will this PR be merged in the future? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#173 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBvm9t9SKr03GCWtvZ-X9NDsZHiDvcIks5tn3XdgaJpZM4P6_v-> .

hantek · 2018-04-12T18:58:29Z

Thanks for letting me know! Zhouhan On Thu, Apr 12, 2018 at 2:12 PM, Mathieu Germain <[email protected]> wrote:

…

The whole smartdispatch seems to be somewhat on hold so I can't say for sure but, yes it was planned to be merged after a couple fixes. On Thu, Apr 12, 2018 at 11:42 AM, Jonathan Zhouhan LIN < ***@***.***> wrote: > Will this PR be merged in the future? > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#173# issuecomment-380850617>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABBvm9t9SKr03GCWtvZ- X9NDsZHiDvcIks5tn3XdgaJpZM4P6_v-> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#173 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE4ojltesSHB6QBOLFjLzf9p8a_DgCvmks5tn5kSgaJpZM4P6_v-> .

-- LIN, Zhouhan （林洲汉） Department of Computer Science and Operational Research University of Montreal Phone: +01-514-586-0551 Email: [email protected] Homepage:http://zhouhanlin.webs.com/

bouthilx · 2018-04-13T02:15:18Z

@hantek SmartDispatch development is on hold for the moment as Mathieu pointed out. Given new clusters and future ones are all using Slurm and given lab users seem comfortable with Slurm itself we are not sure we want to maintain a dispatcher which wraps it. Furthermore, this PR is mostly a hack of SmartDispatch's PBS support. We aren't sure it's worth investing time in finalizing the PR and maintaining this hack.

If you want, you can try my branch https://github.com/bouthilx/smartdispatch/tree/slurm, it combines cedar and graham config and should be python 2.7/3.5 compatible.

hantek · 2018-04-13T16:55:16Z

Thanks for the clarification. So is there a tutorial on how to use slurm in clusters like cedar?

bouthilx · 2018-04-13T16:59:58Z

Not that I'm aware of. That would be a good thing to have. We definitely need to improve the documentation for cluster usage and also add more tutorials.

aalitaiga and others added 30 commits July 31, 2017 15:36

Added test for slurm integration

bd86e21

New test for priority

1168e3e

Added gres + memory tests

243b191

Refactored tests

0a16243

small update

6f31489

Python3 compatibility + PR comments

3b0dd0b

Fixed naccelerators issue

e8e6ec0

Updated tests to skip on Graham and Cedar smartdispatch modified to h…

04f0ff1

…andle slurm clusters

Cleaned code with PR feedback

d7d0300

Updated tests

c7bb250

Updated tests using mock

a20405b

Refactor detect_cluster tests

5bde1c6

Small changes in TestSlurmQueue

0ff4776

Fix add_sbatch_option bug

d1ad338

Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].

Refactor SlurmJobGenerator

899167f

It will need many conversions, not only on resources, so better make it clean.

Remove queue name for Slurm clusters

581f835

Slurm has no queues, so PBS option -q is invalid and non-convertible.

Fix env var export option for Slurm

6b2d530

PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.

Add sbatch to command-line launcher options

ea1d5b3

Add verbosity to smart-dispatch

f3661ba

It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.

Updated documentation for slurm clusters

972a1ab

Add verification script for cedar

02845e0

Add verification script for graham

2d6e6fd

Add verification script for mila

f967180

bouthilx and others added 4 commits October 16, 2017 10:27

Updated README

998f3ba

Set properly account in verify_graham

a3c08c8

The tests were failing because the account was not specified.

Set properly account in verify_cedar

9fb5ab6

The tests were failing because the account was not specified

bouthilx added 2 commits October 17, 2017 16:21

Fix walltime_to_seconds convertion

1dea0d8

There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.

Add configuration file for Cedar

02823d8

bouthilx force-pushed the cedar_config branch from 2a380f6 to 02823d8 Compare October 17, 2017 20:24

bouthilx changed the title ~~Add configuration file for Cedar~~ Add configuration file for Cedar (#168) Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration file for Cedar (#168) #173

Add configuration file for Cedar (#168) #173

bouthilx commented Oct 16, 2017 •

edited

Loading

coveralls commented Oct 16, 2017

nouiz commented Oct 17, 2017

bouthilx commented Oct 17, 2017 •

edited

Loading

coveralls commented Oct 17, 2017

hantek commented Apr 12, 2018

mgermain commented Apr 12, 2018 via email

hantek commented Apr 12, 2018 via email

bouthilx commented Apr 13, 2018

hantek commented Apr 13, 2018

bouthilx commented Apr 13, 2018

Add configuration file for Cedar (#168) #173

Are you sure you want to change the base?

Add configuration file for Cedar (#168) #173

Conversation

bouthilx commented Oct 16, 2017 • edited Loading

coveralls commented Oct 16, 2017

nouiz commented Oct 17, 2017

bouthilx commented Oct 17, 2017 • edited Loading

coveralls commented Oct 17, 2017

hantek commented Apr 12, 2018

mgermain commented Apr 12, 2018 via email

hantek commented Apr 12, 2018 via email

bouthilx commented Apr 13, 2018

hantek commented Apr 13, 2018

bouthilx commented Apr 13, 2018

bouthilx commented Oct 16, 2017 •

edited

Loading

bouthilx commented Oct 17, 2017 •

edited

Loading