This directory contains a number of scripts that make it a little bit easier to start remote training jobs on Google Cloud and download their outputs. The following instructions should help you get up and running.
Note that there are many cloud computing platforms which may perform equally well. These scripts are just a convenience for those that happen to be using gcloud.
The first thing to do is download the gcloud command-line tool and sign up for a Google Cloud account. Google has reasonably good documentation on how to do this. If you wish to run on GPU-accelerated machines, you will need to request a non-zero quota of simultaneous GPUs. Lists of available GPUs and the regions that they are available in can be found here.
In order to actually run any training, you will need to provision virtual machines on which to run. Instances can easily be reused. They come with persistent disks, so you can stop and start them and retain all of your training data. To provision a new virtual machine, or instance, run e.g.
gcloud compute instances create $INSTANCE --image-family pytorch-latest-gpu --image-project deeplearning-platform-release --boot-disk-size=200GB --accelerator type=nvidia-tesla-p100 --machine-type=n1-standard-2 --maintenance-policy TERMINATE
Make sure you ssh
into the instance (gcloud compute ssh $INSTANCE
) and install the nvidia drivers. You should be prompted to install them when logging in.
The gcloud/start-remote-job
script will use a virtual environment in ~/safelife-venv
if it
exists, or try to create it if it doesn't. You can replace that virtual environment with one with
other parameters and state if necessary to resolve python and torch / CUDA issues on your system if
they arise; the gcloud/make-venv.sh
script is the place to edit if you need to do this across
many cloud isntances.
Jobs run on gcloud instances, so first make sure that your instance is running. Use gcloud compute instances list
to view the status of all of your instances, and gcloud compute instances start $INSTANCE
to start a particular instance. Once the instance is up and running you can start a job using e.g.
gcloud/start-remote-job $INSTANCE $JOB_NAME --wandb
That will copy over the benchmarks directory into the appropriately named folder and set up port forwarding for tensorboard onto the specified local port. It will then run ./start-training $JOB_NAME
on the remote instance inside of a tmux
session. To (gracefully) close the connection, hit ctrl-b, d
.
Once a job is running, the following directories will be created on the remote instance:
~/JOB_NAME/ [copy of the root project folder]
~/JOB_NAME/data/ [all the training data gets stored here]
~/current_job [link to whichever job has most recently been run]
Note that there is no per-job directory for dependencies or python virtual environments. It's assumed that each instance is primarily just running SafeLife, and that the dependencies are fairly static between training runs and code updates, so the installed dependencies are global for all jobs on a given instance.
-
Start a remote notebook:
gcloud/ssh $INSTANCE -L 8887:localhost:8888 jupyter notebook
Then you should be able to visit the notebook at
https://localhost:8887/?token=xxx
. The token should be printed during notebook startup. -
Sync job back to localhost:
gcloud/rsync -r $INSTANCE:~/current_job/ ./jobs
or
gcloud/sync-current $INSTANCE
This will copy the entire job directory on the remote machine into a local "jobs" folder.
-
Reconnect to a running remote job:
gcloud/resume $INSTANCE