This document will help you run the TPC-H benchmarks in this directory.
Clone this repository
git clone [email protected]:coiled/benchmarks
cd benchmarks
Follow the environment creation steps in the root directory. Namely the following:
mamba env create -n tpch -f ci/environment.yml
conda activate tpch
pip-compile ci/requirements-2nightly.in # Or `ci/requirements-2tpch-non-dask.in` if you want Spark/DuckDb/Polars
pip install -r ci/requirements-2nightly.txt
pytest --benchmark tests/tpch/test_dask.py
By default we run Scale 100 (about 100 GB) on the cloud with Coiled. You can
configure this by changing the values for _local
and _scale
in the
conftest.py
file in this directory (they're at the top).
If you want to run locally, you'll need to generate data. Run the following from the root directory of this repository.
python tests/tpch/generate_data.py --scale 10
When running on the cloud you can run many tests simultaneously. We recommend using pytest-xdist for this with the keywords:
-n 4
run four parallel jobs--dist loadscope
split apart by module
py.test --benchmark -n 4 --dist loadscope tests/tpch
Timing outputs are dropped into benchmark.db
in the root of this repository.
You can generate charts analyzing results using either the notebook
visualize.ipynb
in this directory (recommended) or the generate-plot.py
script in this directory. These require ibis
and altair
(not installed
above).
These are both meant to be run from the root directory of this repository.
These pull out the most recent records for each query/library pairing. If
you're changing scales and want to ensure clean results, you may want to nuke
your benchmark.db
file between experiments (it's ok, it'll regenerate
automatically).