- Clone this repository in the remote machine
- Set up Miniconda
- Create a new environment:
conda create --name hello_cluster_env python=3.9
- Activate the new environment and install the requirements:
conda activate hello_cluster_env
pip install -r requirements.txt
- You should now be able to run the code as follows:
python main.py --<cmd-line-arguments> <values>
IMPORTANT: Remember, you are not supposed to run your python scripts this way on the clusters. Scripts should always be submitted as jobs (see next section). The login nodes should never be used for any kind of compute (not even, say, to run Tensorboard). Step 5 is only for local installations on your computer.
In the KI-SLURM (Meta) cluster, your account will be penalized by limiting your CPU usage for a while if you run compute-intensive processes on login nodes.
There are two ways to run scripts on the cluster:
- Submit a job using
sbatch
- Run your code in a SLURM interactive session using
srun
Let's begin by looking at sbatch
. First, you need to know which partitions you have access to.
You can use sinfo
to find this information. A sample output might look like this:
PARTITION | AVAIL | TIMELIMIT | NODES | STATE | NODELIST |
---|---|---|---|---|---|
partitionXYZ | up | 2-00:00:00 | 1 | idle | xyzgpu[0-6] |
partitionABC | up | 01:00:00 | 1 | idle | xyzgpu[10, 20] |
See ./scripts/meta/run.sh
for an example of a job script. To submit a job:
- Edit
run.sh
:- Add the partition you want to run the job on
- Adjust the path to your miniconda installation
- Create the directories required for the logs
sbatch scripts/meta/run.sh
squeue
# See all the jobs in the queuesqueue -u user
# See only user's jobs
scancel -u my_user
# Cancel all your jobsscancel <jobid>
# Cancel a specific job
sfree
You can run an interactive session using srun
. You can specify the parameters of the job using the same switches seen in run.sh
. You only require an additional --pty bash
to start a bash session.
srun --partition <your_partition> --mem 6GB --job-name HelloClusterInteractiveSession --pty bash
You will see that you are now logged into a compute node. From here, you may run python scripts as usual:
python main.py --device cuda
Remember that you should only do this from a compute node that you acquired using srun
, never a login node.
We will use VSCode and Simon Schrodi's scripts to debug the code that is running on the cluster. There are two parts to setting this up:
- Install
Remote - SSH
extension on VSCode - Clone Simon's repository and configure the debugging setup
For developing code that sits on remote systems, it is convenient to use VSCode with Remote - SSH extension.
- Bring up the Extensions view (Ctrl+Shift+X / Cmd+Shift+X). Or,
View
>Extensions
- Install
Remote - SSH
extension (Extension ID: ms-vscode-remote.remote-ssh) View
>Command Palette
>Remote-SSH: Connect to Host...
>+ Add New SSH Host
>ssh <your_user>@kislogin2.xx.xx.xxxxxx
- Once you're connected to remote, you should be able to navigate to the directory of your repo in the Explorer (Ctrl+Shift+E / Cmd+Shift+E /
View
>Explorer
)
- Clone Simon's repository in your remote machine, in this repository directory, i.e., inside
/path/to/HelloCluster/
. - Follow the instructions in his repo for the one-time setup.
Your configured config.conf
should look something like this
WORKDIR /path/to/HelloCluster
PORT 4242
LAUNCH_JSON .vscode/launch.json
CONDA_SOURCE /path/to/miniconda3/bin/activate
CONDA_ENV hello_cluster_env
Your .vscode/launch.json
should look like this (comments and pathMappings
removed):
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Remote Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
}
]
}
Don't worry about the mismatch between the port numbers in launch.json
and config.conf
. That will be fixed by init.sh
in the next part.
- Start an interactive session:
srun -p <partition_name> --pty bash
- Initialize
launch.json
with the details fromconfig.conf
bash vscode_remote_debugging/init.sh
- Start the debugging session
bash ./scripts/meta/debug.sh
- The code waits until the client is attached to run.
View
>Run
>Python: Remote Attach
- Debug as if the code is running on your local machine!