Simplify the algorithm tests, setup incremental testing (#35)

* Add parametrize_when_used mark to simplify tests Signed-off-by: Fabrice Normandin <[email protected]> * Add a new, simpler test suite (that works!) Signed-off-by: Fabrice Normandin <[email protected]> * Rename the algorithm tests class (wip) Signed-off-by: Fabrice Normandin <[email protected]> * Further simplify the typing in the example Signed-off-by: Fabrice Normandin <[email protected]> * Remove the older (uglier) test suite for algos Signed-off-by: Fabrice Normandin <[email protected]> * Remove the unused classification test suite Signed-off-by: Fabrice Normandin <[email protected]> * Add missing config Signed-off-by: Fabrice Normandin <[email protected]> * Add the badges in the README Signed-off-by: Fabrice Normandin <[email protected]> * Remove outdated Protocol Signed-off-by: Fabrice Normandin <[email protected]> * Set JAX_PLATFORMS=cpu when no GPU is found Signed-off-by: Fabrice Normandin <[email protected]> * Debugging weird xpass/xfails Signed-off-by: Fabrice Normandin <[email protected]> * (ugly commit) remove unused code, add doctests Signed-off-by: Fabrice Normandin <[email protected]> * Fix some issues in tests Signed-off-by: Fabrice Normandin <[email protected]> * Add missing __init__.py Signed-off-by: Fabrice Normandin <[email protected]> * Fix more issues in tests Signed-off-by: Fabrice Normandin <[email protected]> * Add a docstring in TestJaxExample Signed-off-by: Fabrice Normandin <[email protected]> * Fix weird XPASS in tests Signed-off-by: Fabrice Normandin <[email protected]> * Add batch size fix from Lightning-Hydra-Template Signed-off-by: Fabrice Normandin <[email protected]> * [ugly] Add regression files to check if CI works Signed-off-by: Fabrice Normandin <[email protected]> * Add nice docstrings for env_vars.py Signed-off-by: Fabrice Normandin <[email protected]> * Revert "[ugly] Add regression files to check if CI works" This reverts commit ad1e630. * Use --gen-missing flag in CI for now Signed-off-by: Fabrice Normandin <[email protected]> * Slightly simplify main.py objective calculation Signed-off-by: Fabrice Normandin <[email protected]> * Remove broken test for code blocks in docstrings Signed-off-by: Fabrice Normandin <[email protected]> * Fix test for jax on CPU Signed-off-by: Fabrice Normandin <[email protected]> * Simplify main_test.py Signed-off-by: Fabrice Normandin <[email protected]> * Save regression files in subfolder based on device Signed-off-by: Fabrice Normandin <[email protected]> * Change README and fix link Signed-off-by: Fabrice Normandin <[email protected]> * Trim down docs generation script, minor doc fixes Signed-off-by: Fabrice Normandin <[email protected]> * Skip regression check when files are missing Signed-off-by: Fabrice Normandin <[email protected]> * Reduce amount of warnings generated in tests Signed-off-by: Fabrice Normandin <[email protected]> * Remove unused code in project.utils.utils.py Signed-off-by: Fabrice Normandin <[email protected]> * Fix command-line flag used to skip checks Signed-off-by: Fabrice Normandin <[email protected]> * Tweak README.md Signed-off-by: Fabrice Normandin <[email protected]> * Simplify the actions-runner-job.sh Signed-off-by: Fabrice Normandin <[email protected]> * Add missing flag in build.yml Signed-off-by: Fabrice Normandin <[email protected]> * Fix the `example.yaml` config in example group Signed-off-by: Fabrice Normandin <[email protected]> * Add todos for generating reference docs Signed-off-by: Fabrice Normandin <[email protected]> * Simplify example.py Signed-off-by: Fabrice Normandin <[email protected]> * Add a small docstring to project.configs Signed-off-by: Fabrice Normandin <[email protected]> * Remove overrides in top-level config, fix 'name' Signed-off-by: Fabrice Normandin <[email protected]> * Use hydra_zen.instantiate by default (no pydantic) Signed-off-by: Fabrice Normandin <[email protected]> * Add useful callbacks to defaults Signed-off-by: Fabrice Normandin <[email protected]> * Add/tweak config files Signed-off-by: Fabrice Normandin <[email protected]> * Simplify the network / layers Signed-off-by: Fabrice Normandin <[email protected]> * Don't dynamically create algo configs Signed-off-by: Fabrice Normandin <[email protected]> * Add tensorboard logger config from hydra-template Signed-off-by: Fabrice Normandin <[email protected]> * Rename `optimizer` arg to `optimizer_config` Signed-off-by: Fabrice Normandin <[email protected]> * Fix test_defaults Signed-off-by: Fabrice Normandin <[email protected]> * Update tensor-regression dependency Signed-off-by: Fabrice Normandin <[email protected]> * Fix missing python in actions-runner-job.sh Signed-off-by: Fabrice Normandin <[email protected]> --------- Signed-off-by: Fabrice Normandin <[email protected]>
mila-iqia · Aug 7, 2024 · 264b5a1 · 264b5a1
1 parent e777ca5
commit 264b5a1
Show file tree

Hide file tree

Showing 54 changed files with 1,280 additions and 2,225 deletions.
diff --git a/.github/actions-runner-job.sh b/.github/actions-runner-job.sh
@@ -11,9 +11,9 @@
 
 set -euo pipefail
 
+# todo: load modules here? or in the job steps?
 # module --quiet purge
-# module load cuda/12.2.2
-
+# module load cuda/12.0
 
 
 archive="actions-runner-linux-x64-2.317.0.tar.gz"
@@ -27,54 +27,34 @@ ln --symbolic --force $SCRATCH/$archive $SLURM_TMPDIR/$archive
 
 cd $SLURM_TMPDIR
 
+# Check the archive integrity.
 echo "9e883d210df8c6028aff475475a457d380353f9d01877d51cc01a17b2a91161d  $archive" | shasum -a 256 -c
 
 # Extract the installer
 tar xzf ./actions-runner-linux-x64-2.317.0.tar.gz
 
-# NOTE: Could use this to get a token programmatically!
-# https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-a-registration-token-for-an-organization
-
-# cluster=${SLURM_CLUSTER_NAME:-local}
-cluster=${SLURM_CLUSTER_NAME:-`hostname`}
-
+# Use the GitHub API to get a registration token for a self-hosted runner.
+# This requires you to be an admin of the repository and to have the $SH_TOKEN secret set to your
+# github token.
 # https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-a-registration-token-for-a-repository
-# curl -L \
-#   -X POST \
-#   -H "Accept: application/vnd.github+json" \
-#   -H "Authorization: Bearer <YOUR-TOKEN>" \
-#   -H "X-GitHub-Api-Version: 2022-11-28" \
-#   https://api.github.com/repos/OWNER/REPO/actions/runners/registration-token
-
 # Example output:
 # {
 #   "token": "XXXXX",
 #   "expires_at": "2020-01-22T12:13:35.123-08:00"
 # }
-
-
-if ! command -v jq &> /dev/null; then
-    echo "the jq command doesn't seem to be installed."
-
-    if ! test -f ~/.local/bin/jq; then
-        echo "jq is not found at ~/.local/bin/jq, downloading it."
-        # TODO: this assumes that ~/.local/bin is in $PATH, I'm not 100% sure that this is standard.
-        mkdir -p ~/.local/bin
-        wget https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 -O ~/.local/bin/jq
-        chmod +x ~/.local/bin/jq
-    fi
-fi
-
 source ~/.bash_aliases
+module load python/3.10
 
 TOKEN=`curl -L \
   -X POST \
   -H "Accept: application/vnd.github+json" \
-  -H "Authorization: Bearer ${SH_TOKEN:?The SH_TOKEN env variable is not set}" \
+  -H "Authorization: Bearer $SH_TOKEN" \
   -H "X-GitHub-Api-Version: 2022-11-28" \
-  https://api.github.com/repos/mila-iqia/ResearchTemplate/actions/runners/registration-token | ~/.local/bin/jq -r .token`
+  https://api.github.com/repos/mila-iqia/ResearchTemplate/actions/runners/registration-token | \
+  python -c "import sys, json; print(json.load(sys.stdin)['token'])"`
 
-# Create the runner and configure it programmatically
+# Create the runner and configure it programmatically with the token we just got from the GitHub API.
+cluster=$SLURM_CLUSTER_NAME
 ./config.sh --url https://github.com/mila-iqia/ResearchTemplate --token $TOKEN \
     --unattended --replace --name $cluster --labels $cluster $SLURM_JOB_ID --ephemeral
 

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -54,11 +54,11 @@ jobs:
     - name: Test with pytest (very fast)
       env:
         JAX_PLATFORMS: cpu
-      run: pdm run pytest -v --shorter-than=1.0 --cov=project --cov-report=xml --cov-append
+      run: pdm run pytest -v --shorter-than=1.0 --cov=project --cov-report=xml --cov-append --skip-if-files-missing
     - name: Test with pytest (fast)
       env:
         JAX_PLATFORMS: cpu
-      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append
+      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append --skip-if-files-missing
 
     - name: Store coverage report as an artifact
       uses: actions/upload-artifact@v4
@@ -84,8 +84,7 @@ jobs:
       run: pdm config install.cache true && pdm install
 
     - name: Test with pytest
-      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append
-
+      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append --skip-if-files-missing
     # TODO: this is taking too long to run, and is failing consistently. Need to debug this before making it part of the CI again.
     # - name: Test with pytest (only slow tests)
     #   run: pdm run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append
@@ -142,7 +141,7 @@ jobs:
       run: pdm install
 
     - name: Test with pytest
-      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append
+      run: pdm run pytest -v --cov=project --cov-report=xml --cov-append --gen-missing
 
     # TODO: Re-enable this later
     # - name: Test with pytest (only slow tests)

diff --git a/README.md b/README.md
@@ -1,8 +1,65 @@
 # Research Project Template
 
-![Build](https://github.com/mila-iqia/ResearchTemplate/workflows/build.yml/badge.svg)
+[![Build](https://github.com/mila-iqia/ResearchTemplate/actions/workflows/build.yml/badge.svg?branch=master)](https://github.com/mila-iqia/ResearchTemplate/actions/workflows/build.yml)
 [![codecov](https://codecov.io/gh/mila-iqia/ResearchTemplate/graph/badge.svg?token=I2DYLK8NTD)](https://codecov.io/gh/mila-iqia/ResearchTemplate)
+[![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)
+[![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)](https://github.com/mila-iqia/ResearchTemplate#license)
 
-Please note: This is a **Work-in-Progress**. The goal is to make a first release by the end of summer 2024.
+Please note: This is a Work-in-Progress. The goal is to make a first release by the end of summer 2024.
 
-For now, feel free to take a look at the [documentation page](https://mila-iqia.github.io/ResearchTemplate/) if you want more information about this project.
+This is a template repository for a research project in machine learning. It is meant to be a starting point for new ML researchers that run jobs on SLURM clusters.
+The main target audience is [Mila](https://mila.quebec/en) researchers and students, but this should still be useful to anyone that uses PyTorch-Lightning with Hydra.
+
+For more context, see [this  introduction to the project.](https://mila-iqia.github.io/ResearchTemplate/overview/intro).
+
+## Overview
+
+This project makes use of the following libraries:
+
+- [Hydra](https://hydra.cc/) is used to configure the project. It allows you to define configuration files and override them from the command line.
+- [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) is used to as the training framework. It provides a high-level interface to organize ML research code.
+    - 🔥 Please note: You can also use [Jax](https://jax.readthedocs.io/en/latest/) with this repo, as is shown in the [Jax example](https://mila-iqia.github.io/ResearchTemplate/examples/jax) 🔥
+- [Weights & Biases](https://wandb.ai) is used to log metrics and visualize results.
+- [pytest](https://docs.pytest.org/en/stable/) is used for testing.
+
+## Why use this template?
+
+Why should you use this template (instead of another)?
+
+Here are some of the advantages to using this template compared to [some of the other templates out there](https://mila-iqia.github.io/ResearchTemplate/related):
+
+- ❗Support for both Jax and Torch with PyTorch-Lightning ❗
+- Easy development inside a [Development Container](https://code.visualstudio.com/docs/remote/containers) with [VsCode](https://code.visualstudio.com/)
+- Tailor-made for ML researchers that run their jobs on SLURM clusters (with default configurations for the [Mila](https://docs.mila.quebec) and [DRAC](https://docs.alliancecan.ca) clusters.)
+- Rich typing and documentation of all parts of the source code using Python 3.12's new type annotation syntax
+- A comprehensive suite of automated tests for all algorithms, datasets and networks that are easy to reuse and extend
+- Automatically creates Yaml Schemas for your Hydra config files (as soon as #7 is merged)
+
+## Usage
+
+To see all available options:
+
+```bash
+python project/main.py --help
+```
+
+For a detailed list of examples, see the [examples page](https://mila-iqia.github.io/ResearchTemplate/examples/examples).
+
+<!-- * `mkdocs new [dir-name]` - Create a new project.
+* `mkdocs serve` - Start the live-reloading docs server.
+* `mkdocs build` - Build the documentation site.
+* `mkdocs -h` - Print help message and exit. -->
+
+## Project layout
+
+```
+pyproject.toml   # Project metadata and dependencies
+project/
+    main.py      # main entry-point
+    algorithms/  # learning algorithms
+    datamodules/ # datasets, processing and loading
+    networks/    # Neural networks used by algorithms
+    configs/     # configuration files
+docs/            # documentation
+conftest.py      # Test fixtures and utilities
+```
diff --git a/conftest.py b/conftest.py
@@ -1,6 +1,11 @@
+import os
 from pathlib import Path
 
 import pytest
+import torch
+
+if not torch.cuda.is_available():
+    os.environ["JAX_PLATFORMS"] = "cpu"
 
 
 def pytest_addoption(parser: pytest.Parser):

diff --git a/docs/docs_test.py b/docs/docs_test.py
@@ -0,0 +1,35 @@
+import pathlib
+
+import pytest
+from mktestdocs import check_md_file
+
+# This retrieves all methods/properties that have a docstring.
+# todo: Brittle. We'd like something like griffe, that gets all functions / classes / etc in our module.
+# members = get_codeblock_members(*[v for k, v in vars(project).items() if k != "__all__"])
+
+
+def get_pretty_id(obj):
+    if hasattr(obj, "__qualname__"):
+        return obj.__qualname__
+    if hasattr(obj, "__name__"):
+        return obj.__name__
+    return str(obj)
+
+
+# todo: do we want to run the tests here? or do we just test the doc pages?
+# @pytest.mark.parametrize(
+#     "obj",
+#     list(itertools.chain(map(getmembers, [project, project.configs, project.algorithms]))),
+#     ids=get_pretty_id,
+# )
+# def test_member(obj):
+#     check_docstring(obj)
+
+
+docs_folder = pathlib.Path(__file__).parent
+
+
+# Note the use of `str`, makes for pretty output
+@pytest.mark.parametrize("fpath", docs_folder.rglob("*.md"), ids=str)
+def test_documentation_file(fpath):
+    check_md_file(fpath=fpath)
diff --git a/docs/examples/examples.md b/docs/examples/examples.md
@@ -12,7 +12,7 @@ TODOs:
 ## Simple run
 
 ```bash
-python project/main.py algorithm=example_algo datamodule=mnist network=fcnet
+python project/main.py algorithm=example datamodule=mnist network=fcnet
 ```
 
 ## Running a Hyper-Parameter sweep on a SLURM cluster

diff --git a/docs/generate_reference_docs.py b/docs/generate_reference_docs.py
@@ -3,80 +3,78 @@
 
 
 import textwrap
+from logging import getLogger as get_logger
 from pathlib import Path
 
 import mkdocs_gen_files
-import mkdocs_gen_files.nav
 
 from project.utils.env_vars import REPO_ROOTDIR
 
-module = "project"
-modules = [
-    "project/main.py",
-    "project/experiment.py",
-]
-submodules = [
-    "project.algorithms",
-    "project.configs",
-    "project.datamodules",
-    "project.networks",
-    "project.utils",
-]
+logger = get_logger(__name__)
 
 
-def _get_import_path(module_path: Path) -> str:
-    """Returns the path to use to import a given (internal) module."""
-    return ".".join(module_path.relative_to(REPO_ROOTDIR).with_suffix("").parts)
+def main():
+    add_doc_for_module(REPO_ROOTDIR / "project")
 
 
-def main():
-    nav = mkdocs_gen_files.nav.Nav()
+def add_doc_for_module(module_path: Path) -> None:
+    """Creates a markdown file in the "reference" section for this module and its submodules
+    recursively.
 
-    add_doc_for_module(REPO_ROOTDIR / "project", nav)
+    ## TODOs:
+    - [ ] We don't currently see the docs from the docstrings of __init__.py files.
+    - [ ] Might be nice to show the config files also?
+    """
 
-    # with mkdocs_gen_files.open("reference/SUMMARY.md", "w") as nav_file:
-    #     # assert False, "\n".join(nav.build_literate_nav())
-    #     nav_file.writelines(nav.build_literate_nav())
+    assert module_path.is_dir()  # and (module_path / "__init__.py").exists(), module_path
 
+    # module_import_path = _get_import_path(module_path)
+    # doc_file = module_path.relative_to(REPO_ROOTDIR).with_suffix(".md")
+    # write_doc_file = "reference" / doc_file
+    # with mkdocs_gen_files.editor.FilesEditor.current().open(str(write_doc_file), "w") as f:
+    #     print(
+    #         textwrap.dedent(f"""\
+    #         ::: {module_import_path}
 
-def add_doc_for_module(module_path: Path, nav: mkdocs_gen_files.nav.Nav) -> None:
-    """TODO."""
+    #         """),
+    #         file=f,
+    #     )
 
-    assert module_path.is_dir() and (module_path / "__init__.py").exists(), module_path
+    def is_module(p: Path) -> bool:
+        return (
+            p.suffix == ".py" and not p.name.startswith("__") and not p.name.endswith("_test.py")
+        )
 
-    children = list(
-        p
-        for p in module_path.glob("*.py")
-        if not p.name.startswith("__") and not p.name.endswith("_test.py")
-    )
+    children = list(p for p in module_path.glob("*.py") if is_module(p))
     for child_module_path in children:
         child_module_import_path = _get_import_path(child_module_path)
         doc_file = child_module_path.relative_to(REPO_ROOTDIR).with_suffix(".md")
-        write_doc_file = f"reference/{doc_file}"
+        write_doc_file = "reference" / doc_file
 
-        nav[tuple(child_module_import_path.split("."))] = f"{doc_file}"
-
-        with mkdocs_gen_files.open(write_doc_file, "w") as f:
+        with mkdocs_gen_files.editor.FilesEditor.current().open(str(write_doc_file), "w") as f:
             print(
                 textwrap.dedent(f"""\
                 ::: {child_module_import_path}
                 """),
                 file=f,
             )
-        docs_dir = REPO_ROOTDIR / "docs"
-        module_path_relative_to_docs_dir = child_module_path.relative_to(docs_dir, walk_up=True)
-        mkdocs_gen_files.set_edit_path(write_doc_file, str(module_path_relative_to_docs_dir))
 
     submodules = list(
         p
         for p in module_path.iterdir()
         if p.is_dir()
-        and (p / "__init__.py").exists()
+        and ((p / "__init__.py").exists() or len(list(p.glob("*.py"))) > 0)
         and not p.name.endswith("_test")
         and not p.name.startswith((".", "__"))
     )
     for submodule in submodules:
-        add_doc_for_module(submodule, nav)
+        logger.info(f"Creating doc for {submodule}")
+        add_doc_for_module(submodule)
+
+
+def _get_import_path(module_path: Path) -> str:
+    """Returns the path to use to import a given (internal) module."""
+    return ".".join(module_path.relative_to(REPO_ROOTDIR).with_suffix("").parts)
 
 
 if __name__ in ["__main__", "<run_path>"]: