It's possible to run the Coiled Runtime benchmarks on A/B comparisons, highlighting performance differences between different released versions of dask, distributed, or any of the dependencies and/or between different dask configs.
To run an A/B test:
Branch from main, on the coiled repo itself. Preferably, call the branch
something meaningful, e.g. AB/jobstealing
.
You must create the branch on the Coiled repo (coiled/benchmarks
); CI
workflows will not work on a fork (yourname/benchmarks
).
Open the AB_environments/
directory and rename/create files as needed.
Each A/B runtime is made of exactly three files:
AB_<name>.conda.yaml
(a conda environment file)AB_<name>.dask.yaml
(a dask configuration file)AB_<name>.cluster.yaml
(coiled.Cluster kwargs)
You may create as many A/B runtime configs as you want in a single coiled-runtime
branch.
You can use the utility make_envs.py <name>, [name], ...
to automate file creation.
The conda environment file can contain whatever you want, as long as it can run the tests; e.g.
channels:
- conda-forge
dependencies:
- python =3.9
- <copy-paste from ci/environment.yaml, minus bits you want to change>
# Changes from the default environment start here
- dask ==2023.4.1
- distributed ==2023.4.1
Instead of published packages, you could also use arbitrary git hashes of arbitrary forks, e.g.
- pip:
- git+https://github.com/dask/dask@b85bf5be72b02342222c8a0452596539fce19bce
- git+https://github.com/yourname/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717
You may also ignore the default environment and go for a barebones environment. The bare
minimum you need to install is dask
, distributed
, coiled
and s3fs
.
This will however skip some tests, e.g. zarr and ML-related ones, and it will also
expose you to less controlled behaviour e.g. dependent on which versions of numpy and
pandas are pulled in:
channels:
- conda-forge
dependencies:
- python =3.9
- dask ==2023.4.1
- distributed ==2023.4.1
- coiled
- s3fs
The second file in each triplet is a dask config file. If you don't want to change the config, you must create an empty file.
e.g.
distributed:
scheduler:
work-stealing: False
The third and final file defines creation options to the dask cluster. It must be formatted as follows:
default:
<kwarg>: <value>
...
<cluster name>:
<kwarg>: <value>
...
<cluster name>
must be:
small_cluster
, for all tests decorated withsmall_cluster
parquet_cluster
, for all tests intest_parquet.py
- others: please refer to
cluster_kwargs.yaml
The default
section applies to all <cluster name>
sections, unless explicitly
overridden.
Anything that's omitted defaults to the contents of cluster_kwargs.yaml
. Leave this
file blank if you're happy with the defaults.
For example:
small_cluster:
n_workers: 10
worker_vm_types: [m6i.large] # 2CPU, 8GiB
If you create any files in AB_environments/
, you must create the baseline environment:
AB_baseline.conda.yaml
AB_baseline.dask.yaml
AB_baseline.cluster.yaml
Open AB_environments/config.yaml
and set the repeat
setting to a number higher than 0.
This enables the A/B tests.
Setting a low number of repeated runs is faster and cheaper, but will result in higher
variance. Setting it to 5 is a good value to get statistically significant results.
repeat
must remain set to 0 in the main branch, thus completely disabling
A/B tests, in order to avoid unnecessary runs.
In the same file, you can also set the test_null_hypothesis
flag to true to
automatically create a verbatim copy of AB_baseline and then compare the two in the A/B
tests. Set it to false to save some money if you are already confident that the 'repeat'
setting is high enough.
The file offers a targets
list. These can be test directories, individual test files,
or individual tests that you wish to run.
The file offers a markers
string expression, to be passed to the -m
pytest parameter
if present. See setup.cfg for the available ones.
h2o_datasets
is a list of datasets to run through in
tests/benchmarks/test_h2o.py
. Refer to the file for the possible choices.
Finally, the max_parallel
setting lets you tweak maximum test parallelism, both in
github actions and in pytest-xdist. Reducing parallelism is useful when testing on very
large clusters (e.g. to avoid having 20 clusters with 1000 workers each at the same
time).
Nothing prevents you from changing the tests themselves; for example, you may be interested in some specific test, but on double the regular size, half the chunk size, etc.
You want to test the impact of disabling work stealing on the latest version of dask. You'll create at least 4 files:
AB_environments/AB_baseline.conda.yaml
:
channels:
- conda-forge
dependencies:
- python =3.9
- coiled
- dask
- distributed
- s3fs
AB_environments/AB_baseline.dask.yaml
: (empty file)AB_environments/AB_baseline.cluster.yaml
: (empty file)AB_environments/AB_no_steal.conda.yaml
: (same as baseline)AB_environments/AB_no_steal.dask.yaml
:
distributed:
scheduler:
work-stealing: False
AB_environments/AB_no_steal.cluster.yaml
: (empty file)AB_environments/config.yaml
:
repeat: 5
test_null_hypothesis: true
targets:
- tests/benchmarks
h2o_datasets:
- 5 GB (parquet+pyarrow)
max_parallel:
ci_jobs: 5
pytest_workers_per_job: 4
git push
. Note: you should not open a Pull Request.- Open [the GitHub Actions tab] (https://github.com/coiled/benchmarks/actions/workflows/ab_tests.yml) and wait for the run to complete.
- Open the run from the link above. In the Summary tab, scroll down and download the
static-dashboard
artifact. Note: artifacts will appear only after the run is complete. - Decompress
static-dashboard.zip
and openindex.html
in your browser.
Remember to delete the branch once you're done.
Environment build fails with a message such as:
coiled 0.2.27 requires distributed>=2.23.0, but you have distributed 2.8.0+1709.ge0932ec2 which is incompatible.
Your conda environment points to a fork of dask/dask or dask/distributed, but its owner did not synchronize the tags. To fix, the owner of the fork must run:
$ git remote -v
origin https://github.com/yourname/distributed.git (fetch)
origin https://github.com/yourname/distributed.git (push)
upstream https://github.com/dask/distributed.git (fetch)
upstream https://github.com/dask/distributed.git (push)
$ git fetch upstream --tags # Or whatever alias the dask org was added as above
$ git push origin --tags # Or whatever alias the fork was added as above
As a handy copy-paste to run from the root dir of this repository:
pushd ../dask && git fetch upstream --tags && git push origin --tags && popd
pushd ../distributed && git fetch upstream --tags && git push origin --tags && popd
pushd ../dask-expr && git fetch upstream --tags && git push origin --tags && popd
The conda environment fails to build, citing incompatibilities with openssl
Double check that you didn't accidentally type - python ==3.9
, which means 3.9.0,
instead of - python =3.9
, which means the latest available patch version of 3.9.
You get very obscure failures in the workflows, which you can't seem to replicate
Double check that you don't have the same packages listed as conda package and under the
special - pip:
tag. Installing a package with conda and then upgrading it with pip
typically works, but it's been observed not to (e.g. xgboost).
Specifically, dask
and distributed
can be installed with conda and then upgraded
with pip, but they must be both upgraded. Note that specifying the same version with
pip of a package won't upgrade it.
This is bad:
dependencies:
- pip:
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717
dask-2023.3.2 and distributed-2023.3.2 will be installed from conda anyway, e.g. by coiled.
This is bad:
dependencies:
- dask ==2023.3.2
- distributed ==2023.3.2
- pip:
- dask ==2023.3.2
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717
The dask version is the same in pip and conda, so conda wins.
This is good:
dependencies:
- dask ==2023.3.2
- distributed ==2023.3.2
- pip:
- dask ==2023.4.1
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717
dask and distributed versions from conda are properly uninstalled after they serve as a dependency for the other conda packages.