Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC-35] Add tests that run the examples on the cluster #209

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

lebrice
Copy link
Contributor

@lebrice lebrice commented Aug 1, 2023

Jira link: https://mila-iqia.atlassian.net/browse/DOC-35

  • The tests can be run manually with pytest from the mila cluster. They are not yet meant to be part of the CI of this repo.
  • Uses submitit to launch the examples. A few tweaks were required in order for it to work.
  • Uses pytest_regressions to compare the (filtered) job outputs over time.

@lebrice lebrice changed the title [DOC-35] Add tests that run the examples using submitit (WIP, draft) [DOC-35] Add tests that run the examples on the cluster (WIP, draft) Aug 1, 2023
@lebrice lebrice force-pushed the lebrice/test_examples branch from 555a9a6 to 9a05d48 Compare August 7, 2023 19:27
@lebrice lebrice marked this pull request as ready for review September 14, 2023 17:11
@lebrice lebrice changed the title [DOC-35] Add tests that run the examples on the cluster (WIP, draft) [DOC-35] Add tests that run the examples on the cluster Sep 14, 2023
@lebrice lebrice requested a review from obilaniu as a code owner November 7, 2023 18:25
Signed-off-by: Fabrice Normandin <[email protected]>

Make all job script executable

Signed-off-by: Fabrice Normandin <[email protected]>

Move common stuff to a `run_example` function

Signed-off-by: Fabrice Normandin <[email protected]>

Add regex substitutions before comparing outputs

Signed-off-by: Fabrice Normandin <[email protected]>

Make the Pytorch-based examples reproducible

Signed-off-by: Fabrice Normandin <[email protected]>

Reduce the number of GPUs per node from 4 to 2

Signed-off-by: Fabrice Normandin <[email protected]>

Unified test for pytorch-based examples

Signed-off-by: Fabrice Normandin <[email protected]>

Add a `make_env.sh` sbatch script in pytorch setup

Signed-off-by: Fabrice Normandin <[email protected]>

Simplify the `test_examples.py` file

Signed-off-by: Fabrice Normandin <[email protected]>

Update the regression files for the examples

Signed-off-by: Fabrice Normandin <[email protected]>

Add regression file for multi-node example

Signed-off-by: Fabrice Normandin <[email protected]>

Add the `pip install orion` line to Orion example

Signed-off-by: Fabrice Normandin <[email protected]>

Add a test for the checkpointing example

Signed-off-by: Fabrice Normandin <[email protected]>

Add the regression files for checkpointing example

Signed-off-by: Fabrice Normandin <[email protected]>

Fix regression file for the ckpt example test

Signed-off-by: Fabrice Normandin <[email protected]>

Split test code into testutils and test file

Signed-off-by: Fabrice Normandin <[email protected]>

Start to add test for "HPO with Orion" example

Signed-off-by: Fabrice Normandin <[email protected]>

Remove potentially buggy asserts

Signed-off-by: Fabrice Normandin <[email protected]>

Make a conda env for the Orion example

Signed-off-by: Fabrice Normandin <[email protected]>
@lebrice lebrice force-pushed the lebrice/test_examples branch from 1d5ddb9 to 2b204d2 Compare November 29, 2023 19:01
Signed-off-by: Fabrice Normandin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant