Do you still want to write sbatch
files yourself?
spython
is meant to improve your user experience of python when working on Jean-Zay's slurm system.
It is in an alpha-development stage (i.e. developed for a single user and now open to be tested and improved by many).
Jean Zay is one of the coolest thing that has happened to France in terms of computing (optimized, well managed, abundant, open). Its support team is the best example of a French institution: extremely polite, trying to help you in everything and quick to answer (unfortunately not how French beaurocracy looks on average).
However, there are two main problems with using Jean-Zay: it's like coding in C instead of python and it's counter-productive. Most of this comes from the fact of both complex rules that need to be read and understood (similar to law - but more pragmatic) but also for most operations it's simply a matter of copy pasting sbatch scripts around that are hard to read, understanding and digesting usage regulations and learning how to allocate modules, priorities and compute nodes.
Here is where this script comes in. spython
is a simple script that does all that for you and lifts interaction with sbatch
instead to a more higher level of commands, implementing the Jean-Zay documentation. For example, you can directy write the number of GPUs you want ot use the memory of each in gb's or the total cpu ram with any of these arguments --ngpu 2
--gb 16
--ram l
(if arguments are missing it can self configure itself to optimal values) or e.g. select time --t 19
or place it easily to --debug
priority quickly. It has even a function called --live
where the user experience of running a program imitates that of directly running python!* Output logs and scripts are automatically saved under a log directory in $STORE
and tagged according to a tag --tag
hyperparameter.
*without ctrl-c for practical purposes.
We hope that the Jean-Zay team & community get's to converge in high-level scripts or libraries that help improve quick and wasteless integration. Using such scripts high level practices can directly get implemented on a program level (instead of documentation) so that Jean-Zay is used much more optimally.
As this package sprawls from the user-experience of a certain user it can be limited to their explored use-cases. However you can easily incorporate your use-case to this package, by making pull-requests. If this package becomes useful, documentation and a python installation will be the next steps.
To use spython
you firstly need to install it in your system, which takes not more than 10 seconds. It carries no-python dependencies and can work on native jean-zay python.
Currently the project requires manual installation, which can be done in 2 simple steps:
wget -O ~/jean-zay.py https://raw.githubusercontent.com/ysig/spython/main/jean-zay/jean-zay.py # installs the main-script
wget -O - https://raw.githubusercontent.com/ysig/spython/main/jean-zay/.bashrc >> ~/.bashrc # installs macros
After reopening a terminal you will be exposed to the cli-commands:
spython
saccelerate
sl
- (hopefully more to be contributed in the future).
For all our examples we will suppose that what we really want is to run a train.py
.
imply append to your python script command :
adding a set of Jean Zay allocation specific arguments after it:
spython train.py <ARGS> : --ngpu 2 --ncpu 10 --module-load pytorch-gpu/py3/1.11.0 --tag EXP --gb 32 --env diffusers --post-script jean-zay/server.sh --email <your-email> --name test
From this script we can firstly notice that we we allocate 2 gpus and 10 cpus per task.
We then module load
a certain package (but multiple can be chained) and conda activate geography
after the module is loaded.
We also copy paste server a script jean-zay/server.sh
to run after environment is loaded and send emails to a specific address about the status of the experiment using test
as a name.
Running directly:
spython --ngpu 2 --ncpu 10 --module-load pytorch-gpu/py3/1.11.0 --tag EXP --gb 32 --post-script jean-zay/server.sh
will just print a command to run to generate the same environment:
srun --pty --job-name=test --constraint=v100-32g --nodes=1 --gres=gpu:2 --cpus-per-task=10 --ntasks=2 --ntasks-per-node=2 --time=72:00:00 --qos=qos_gpu-t4 -A hkt@v100 bash
but omitting any in environment commands like running the server.sh
or module-load.
All our logs can be located by default in the $SUB
folder.
To quickly access them we provide an sl
macro that takes 3 arguments:
sl <tag> <k=-1> [--script]
where:
- running
sl
alone prints all available tags - adding k (that defaults to -1) indexes the available logs as an index in a chronological list
- using
--script
prints the sbatch script instead of the log.
For more covered functionality below follows the detailed --help
output of the script:
usage: jean-zay.py [-h] [--gb {16,32,40,80}] [--ram {l,m,h}] [--debug]
[--ngpu NGPU] [--ncpu NCPU] [--hours HOURS]
[--minutes MINUTES] [--name NAME]
[--module-load MODULE_LOAD [MODULE_LOAD ...]]
[--ntasks NTASKS] [--ntasks-per-node NTASKS_PER_NODE]
[--submission-dir SUBMISSION_DIR] [--error-file ERROR_FILE]
[--output-file OUTPUT_FILE] [--script-file SCRIPT_FILE]
[--conda_path CONDA_PATH] [--command COMMAND]
[--email EMAIL] [--env ENV] [--preload] [--prepost]
[--account ACCOUNT] [--tag TAG] [--path PATH] [--live]
[--post-script POST_SCRIPT]
optional arguments:
-h, --help show this help message and exit
--gb {16,32,40,80} Number of GBs GPU to reserve (default: None)
--ram {l,m,h} Memory Mode (default: None)
--debug Debug (default: False)
--ngpu NGPU, -g NGPU Num GPUs (default: 1)
--ncpu NCPU, -c NCPU Num CPUs (default: 10)
--hours HOURS, -t HOURS
Max Time (hrs) (default: 72)
--minutes MINUTES, -m MINUTES
Max Time (mins) (default: 0)
--name NAME, -n NAME Job Name (default: None)
--module-load MODULE_LOAD [MODULE_LOAD ...], -ml MODULE_LOAD [MODULE_LOAD ...]
--ntasks NTASKS Number of MP tasks (default: None)
--ntasks-per-node NTASKS_PER_NODE
Number of tasks per node (default: None)
--submission-dir SUBMISSION_DIR
Directory where submission files and outputs are
stored (default: None)
--error-file ERROR_FILE, -e ERROR_FILE
Error file (default: log.txt)
--output-file OUTPUT_FILE, -o OUTPUT_FILE
Output file (default: log.txt)
--script-file SCRIPT_FILE, -s SCRIPT_FILE
Script file (default: script.txt)
--conda_path CONDA_PATH
Path of conda (default: None)
--command COMMAND Main command to execute (default: python)
--email EMAIL Email of user (default: None)
--env ENV Environment (default: None)
--preload Preload - if not set modules wil be purged (default:
True)
--prepost Set on a prepost node (default: False)
--account ACCOUNT Manually set account name (default: None)
--tag TAG Set an experiment tag (default: None)
--path PATH Set explicit path from which to start the experiment
(default: /gpfs7kw/linkhome/rech/genlgm01/uoi78rt)
--live Debug (default: False)
--post-script POST_SCRIPT
A script to be executed before the main command
(default: None)
As a BTS example we will show what happens with our current support saccelerate
.
The user will simply run:
saccelerate <ARGS> : --ngpu 2 --ncpu 10 --module-load pytorch-gpu/py3/1.11.0 --tag <TAG> --gb 32 --post-script jean-zay/server.sh --debug --live
but the following script gets generated:
For example
On /gpfsstore/rech/hkt/$USER/submissions/EXP/20230427-153034/
you will find two files:
log.txt
on the same address will save all std output.script.txt
contains thesbatch script
#!/bin/bash
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:block
#SBATCH --job-name=$USER
#SBATCH --constraint=v100-32g
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=10
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1:00:00
#SBATCH --qos=qos_gpu-dev
#SBATCH --output=/gpfsstore/rech/hkt/$USER/submissions/<TAG>/20230425- 222102/log.txt
#SBATCH --error=/gpfsstore/rech/hkt/$USER/submissions/<TAG>/20230425-222102/log.txt
#SBATCH -A hkt@v100
source /linkhome/rech/genlgm01/$USER/.bashrc
module purge
module load pytorch-gpu/py3/1.11.0
set -x
cd /gpfsdswork/projects/rech/hkt/<user-id>/<PROJECT-PATH>
module load singularity
singularity exec --bind $SCRATCH:/data $SINGULARITY_ALLOWED_DIR/tileserver.sif node /usr/src/app/ /data/planet.mbtiles &> /dev/null &
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=6000
srun accelerate launch --num_processes 2 --multi_gpu --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py
Do you still want to write sbatch
files yourself?