See our developing doc for build pre-requisites.
To build all of the templates and output Kubernetes resources, run the following:
scripts/gen-tests.sh
This command will output Kubernetes CronJob
resources into k8s/
directory.
Note: Googlers and contributors working out of this repository don't need to manually deploy generated Kubernetes resources with kubectl
, since we have triggers set up to do that automatically.
To list all of the correctly configured tests, you can run
$ ./scripts/list-tests.sh
+ jsonnet -J . -S tests/list_tests.jsonnet
flax.latest-resnet-imagenet-conv-v3-32
flax.latest-resnet-imagenet-func-v2-8
flax.latest-vit-imagenette-conv-v3-32
flax.latest-vit-imagenette-conv-v4-32
...
This can be helpful for checking that your newly added test is configured correctly, or to extract the correct name to run a one shot test.
To manually run one shot tests, first connect to a cluster and then run the following:
export TEST_NAME=tf.nightly-dlrm-criteo-conv-v100-x1
jsonnet tests/oneshot.jsonnet -J . -S --tla-str test=$TEST_NAME | kubectl create -f -
The above command will generate a job id such as job.batch/tf.nightly-dlrm-criteo-conv-v100-x1-gz8ww
. To find the detail of the test, search in GoogleCloud->Kubernetes->workload in the project xl-ml-test
with the job id tf.nightly-dlrm-criteo-conv-v100-x1-gz8ww
.
For convenience, the steps of connecting to a cluster and running a one shot test have been combined into a single script as follows:
export TEST_NAME=tf.nightly-dlrm-criteo-conv-v100-x1
./scripts/run-oneshot.sh -t $TEST_NAME
Other flags:
-d | --dryrun
if set, then the test does not run but only prints commands.-h | --help
prints the help screen.
In case you want to run multiple tests, you might find it convenient to combine the above scripts as follows:
./scripts/list-tests.sh | grep "tf" | grep "nightly" | grep "mnist" while read -r test; do ./scripts/run-oneshot.sh -t $test; done
Please be mindful of the resources in the project before running this.
If you want to run a group of tests, e.g. all pt-nightly
tests, you can do so using the schedule_tests.sh
script. You will need to set the XLML_TEST_TYPE
based on the root of the test, e.g.
XLML_TEST_TYPE=pt-nightly ./scripts/schedule_tests.sh
This should only be done when absolutely necessary, e.g. during release testing.
To create a new test, start by copying a similar file from the same ML framework and version. Update the training commands as necessary, and add that file to the targets.jsonnet
in the same directory.
See here for details on configuring alerts and recording the training metrics of your test.
Before you send your code for review, we recommend that you run a one-shot test using the command above to ensure that the test works as expected. If you're not sure what the generated name of your test will be, try running multifile.jsonnet
to see what the file names of the generated tests are.