Skip to content

Commit

Permalink
Fix reproducibility issues in tests (#63)
Browse files Browse the repository at this point in the history
* Fix reproducibility issues in tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Fix creation of `lightning_logs` dir in tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Re-add regression files to git index

Signed-off-by: Fabrice Normandin <[email protected]>

* Try to fix issues with hf_example_test.py

Signed-off-by: Fabrice Normandin <[email protected]>

* re-enable --slow tests in last CI step

* use rye run, not pdm run

* Don't skip if files are missing

Signed-off-by: Fabrice Normandin <[email protected]>

* Run full regression tests on dev machine

Signed-off-by: Fabrice Normandin <[email protected]>

* Remove GPU name from regression files

Signed-off-by: Fabrice Normandin <[email protected]>

* Remove code to select tests based on duration

Signed-off-by: Fabrice Normandin <[email protected]>

* Tweak incremental testing annotation

Signed-off-by: Fabrice Normandin <[email protected]>

* Tweak the `rye sync` in local integration tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Try disabling rye cache in local_integration_tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Update tensor_regression and precision in files

Signed-off-by: Fabrice Normandin <[email protected]>

* Revert "Try disabling rye cache in local_integration_tests"

This reverts commit 8fd85a4.

* Revert "Tweak the `rye sync` in local integration tests"

This reverts commit 54c1d55.

* Fix hash function used in regression tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Don't include the tensor hash in regression files

Signed-off-by: Fabrice Normandin <[email protected]>

* Remove hashes from existing regression files

Signed-off-by: Fabrice Normandin <[email protected]>

* Show installed packages in slurm integration tests

Signed-off-by: Fabrice Normandin <[email protected]>

* Add an xfail on a specific regression test

Signed-off-by: Fabrice Normandin <[email protected]>

---------

Signed-off-by: Fabrice Normandin <[email protected]>
  • Loading branch information
lebrice authored Oct 9, 2024
1 parent 612647f commit cf197f8
Show file tree
Hide file tree
Showing 65 changed files with 8,445 additions and 179 deletions.
26 changes: 13 additions & 13 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,7 @@ jobs:
run: rye pin ${{ matrix.python-version }}
- name: Install dependencies
run: rye sync --no-lock
- name: Test with pytest (very fast)
env:
JAX_PLATFORMS: cpu
run: rye run pytest -v --shorter-than=1.0 --cov=project --cov-report=xml --cov-append --skip-if-files-missing
- name: Test with pytest (fast)
- name: Test with pytest
env:
JAX_PLATFORMS: cpu
run: rye run pytest -v --cov=project --cov-report=xml --cov-append --skip-if-files-missing
Expand Down Expand Up @@ -108,10 +104,10 @@ jobs:
run: rye sync --no-lock

- name: Test with pytest
run: rye run pytest -v --cov=project --cov-report=xml --cov-append --skip-if-files-missing
# TODO: this is taking too long to run, and is failing consistently. Need to debug this before making it part of the CI again.
# - name: Test with pytest (only slow tests)
# run: pdm run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append
run: rye run pytest -v --cov=project --cov-report=xml --cov-append

- name: Test with pytest (only slow tests)
run: rye run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append

- name: Store coverage report as an artifact
uses: actions/upload-artifact@v4
Expand Down Expand Up @@ -159,13 +155,17 @@ jobs:
- name: Set up the repo using the setup script
run: scripts/mila_setup.sh
- name: Install dependencies
run: rye sync --no-lock
run: rye sync --all-features --no-lock
- name: Show installed packages
run: rye list
- name: Test with pytest
run: rye run pytest -v --cov=project --cov-report=xml --cov-append --skip-if-files-missing
run: rye run pytest -v --cov=project --cov-report=xml --cov-append

# TODO: Re-enable this later
# TODO: Disabling full regression tests on the cluster for now, because the worker is often
# interrupted and we want to avoid using the unkillable partition to not interrupt other's
# work. The full regression tests are still run on the local_integration_tests job.
# - name: Test with pytest (only slow tests)
# run: pdm run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append
# run: rye run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append

- name: Store coverage report as an artifact
uses: actions/upload-artifact@v4
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
batch.0:
device: cpu
max: '2.126e+00'
mean: '-6.179e-03'
min: '-1.989e+00'
shape:
- 128
- 3
- 32
- 32
sum: '-2.43e+03'
batch.1:
device: cpu
max: 9
mean: '4.555e+00'
min: 0
shape:
- 128
sum: 583
grads.network.0.1.bias:
device: cpu
max: '6.107e-03'
mean: '1.775e-04'
min: '-5.292e-03'
shape:
- 128
sum: '2.272e-02'
grads.network.0.1.weight:
device: cpu
max: '1.307e-02'
mean: '4.693e-05'
min: '-1.141e-02'
shape:
- 128
- 3072
sum: '1.845e+01'
grads.network.1.0.bias:
device: cpu
max: '1.041e-02'
mean: '6.975e-04'
min: '-8.782e-03'
shape:
- 128
sum: '8.928e-02'
grads.network.1.0.weight:
device: cpu
max: '1.584e-02'
mean: '1.481e-04'
min: '-1.507e-02'
shape:
- 128
- 128
sum: '2.426e+00'
grads.network.2.0.bias:
device: cpu
max: '3.282e-02'
mean: '-1.956e-09'
min: '-2.134e-02'
shape:
- 10
sum: '-1.956e-08'
grads.network.2.0.weight:
device: cpu
max: '2.200e-02'
mean: '-2.874e-10'
min: '-5.831e-02'
shape:
- 10
- 128
sum: '-3.679e-07'
outputs.logits:
device: cpu
max: '7.036e-01'
mean: '-8.651e-03'
min: '-8.180e-01'
shape:
- 128
- 10
sum: '-1.107e+01'
outputs.loss:
device: cpu
max: '2.316e+00'
mean: '2.316e+00'
min: '2.316e+00'
shape: []
sum: '2.316e+00'
outputs.y:
device: cpu
max: 9
mean: '4.555e+00'
min: 0
shape:
- 128
sum: 583
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
batch.0:
device: cpu
max: '2.821e+00'
mean: '4.822e-01'
min: '-4.242e-01'
shape:
- 128
- 1
- 28
- 28
sum: '4.839e+04'
batch.1:
device: cpu
max: 9
mean: '4.555e+00'
min: 0
shape:
- 128
sum: 583
grads.network.0.1.bias:
device: cpu
max: '6.875e-03'
mean: '2.096e-04'
min: '-8.370e-03'
shape:
- 128
sum: '2.683e-02'
grads.network.0.1.weight:
device: cpu
max: '1.948e-02'
mean: '2.916e-04'
min: '-2.213e-02'
shape:
- 128
- 784
sum: '2.926e+01'
grads.network.1.0.bias:
device: cpu
max: '1.109e-02'
mean: '2.213e-04'
min: '-1.267e-02'
shape:
- 128
sum: '2.832e-02'
grads.network.1.0.weight:
device: cpu
max: '2.374e-02'
mean: '9.326e-05'
min: '-2.32e-02'
shape:
- 128
- 128
sum: '1.528e+00'
grads.network.2.0.bias:
device: cpu
max: '3.847e-02'
mean: '-3.353e-09'
min: '-4.706e-02'
shape:
- 10
sum: '-3.353e-08'
grads.network.2.0.weight:
device: cpu
max: '5.741e-02'
mean: '-4.195e-10'
min: '-6.431e-02'
shape:
- 10
- 128
sum: '-5.369e-07'
outputs.logits:
device: cpu
max: '9.872e-01'
mean: '-1.288e-02'
min: '-7.225e-01'
shape:
- 128
- 10
sum: '-1.648e+01'
outputs.loss:
device: cpu
max: '2.311e+00'
mean: '2.311e+00'
min: '2.311e+00'
shape: []
sum: '2.311e+00'
outputs.y:
device: cpu
max: 9
mean: '4.555e+00'
min: 0
shape:
- 128
sum: 583
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
batch.0:
device: cpu
max: '2.821e+00'
mean: '1.432e-02'
min: '-4.242e-01'
shape:
- 128
- 1
- 28
- 28
sum: '1.437e+03'
batch.1:
device: cpu
max: 9
mean: '4.242e+00'
min: 0
shape:
- 128
sum: 543
grads.network.0.1.bias:
device: cpu
max: '1.075e-02'
mean: '2.421e-04'
min: '-7.844e-03'
shape:
- 128
sum: '3.099e-02'
grads.network.0.1.weight:
device: cpu
max: '2.006e-02'
mean: '5.258e-05'
min: '-1.844e-02'
shape:
- 128
- 784
sum: '5.277e+00'
grads.network.1.0.bias:
device: cpu
max: '1.169e-02'
mean: '4.285e-04'
min: '-1.152e-02'
shape:
- 128
sum: '5.485e-02'
grads.network.1.0.weight:
device: cpu
max: '1.753e-02'
mean: '1.016e-04'
min: '-2.219e-02'
shape:
- 128
- 128
sum: '1.665e+00'
grads.network.2.0.bias:
device: cpu
max: '3.969e-02'
mean: '-1.304e-09'
min: '-7.979e-02'
shape:
- 10
sum: '-1.304e-08'
grads.network.2.0.weight:
device: cpu
max: '3.221e-02'
mean: '-1.306e-10'
min: '-6.755e-02'
shape:
- 10
- 128
sum: '-1.672e-07'
outputs.logits:
device: cpu
max: '7.029e-01'
mean: '-3.564e-02'
min: '-7.781e-01'
shape:
- 128
- 10
sum: '-4.562e+01'
outputs.loss:
device: cpu
max: '2.304e+00'
mean: '2.304e+00'
min: '2.304e+00'
shape: []
sum: '2.304e+00'
outputs.y:
device: cpu
max: 9
mean: '4.242e+00'
min: 0
shape:
- 128
sum: 543
Loading

0 comments on commit cf197f8

Please sign in to comment.