This page contains instructions for setting up training on Microsoft Azure through either Azure Container Instances or Virtual Machines. Non "headless" training has not yet been tested to verify support.
A pre-configured virtual machine image is available in the Azure Marketplace and is nearly completely ready for training. You can start by deploying the Data Science Virtual Machine for Linux (Ubuntu) into your Azure subscription. Once your VM is deployed, SSH into it and run the following command to complete dependency installation:
pip3 install docopt
Note that, if you choose to deploy the image to an N-Series GPU optimized VM, training will, by default, run on the GPU. If you choose any other type of VM, training will run on the CPU.
Setting up your own instance requires a number of package installations. Please view the documentation for doing so here.
- Move
the
ml-agents
sub-folder of this ml-agents repo to the remote Azure instance, and set it as the working directory. - Install the required packages with
pip3 install .
.
To verify that all steps worked correctly:
- In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
- Open the Build Settings window (menu: File > Build Settings).
- Select Linux as the Target Platform, and x86_64 as the target architecture.
- Check Headless Mode.
- Click Build to build the Unity environment executable.
- Upload the resulting files to your Azure instance.
- Test the instance setup from Python using:
from mlagents.envs import UnityEnvironment
env = UnityEnvironment(<your_env>)
Where <your_env>
corresponds to the path to your environment executable.
You should receive a message confirming that the environment was loaded successfully.
To run your training on the VM:
- Move your built Unity application to your Virtual Machine.
- Set the the directory where the ML-Agents Toolkit was installed to your working directory.
- Run the following command:
mlagents-learn <trainer_config> --env=<your_app> --run-id=<run_id> --train
Where <your_app>
is the path to your app (i.e.
~/unity-volume/3DBallHeadless
) and <run_id>
is an identifier you would like
to identify your training run with.
If you've selected to run on a N-Series VM with GPU support, you can verify that
the GPU is being used by running nvidia-smi
from the command line.
Once you have started training, you can use TensorBoard to observe the training.
-
Start by opening the appropriate port for web traffic to connect to your VM.
- Note that you don't need to generate a new
Network Security Group
but instead, go to the Networking tab under Settings for your VM. - As an example, you could use the following settings to open the Port with
the following Inbound Rule settings:
- Source: Any
- Source Port Ranges: *
- Destination: Any
- Destination Port Ranges: 6006
- Protocol: Any
- Action: Allow
- Priority: (Leave as default)
- Note that you don't need to generate a new
-
Unless you started the training as a background process, connect to your VM from another terminal instance.
-
Run the following command from your terminal
tensorboard --logdir=summaries --host 0.0.0.0
-
You should now be able to open a browser and navigate to
<Your_VM_IP_Address>:6060
to view the TensorBoard report.
Azure Container Instances allow you to spin up a container, on demand, that will run your training and then be shut down. This ensures you aren't leaving a billable VM running when it isn't needed. You can read more about The ML-Agents toolkit support for Docker containers here. Using ACI enables you to offload training of your models without needing to install Python and TensorFlow on your own computer. You can find instructions, including a pre-deployed image in DockerHub for you to use, available here.