Skip to content

Latest commit

 

History

History
77 lines (56 loc) · 1.99 KB

README.md

File metadata and controls

77 lines (56 loc) · 1.99 KB

TPC-H Benchmarks

This document will help you run the TPC-H benchmarks in this directory.

Setup

Clone this repository

git clone [email protected]:coiled/benchmarks
cd benchmarks

Follow the environment creation steps in the root directory. Namely the following:

mamba env create -n tpch -f ci/environment.yml
conda activate tpch
pip-compile ci/requirements-2nightly.in         # Or `ci/requirements-2tpch-non-dask.in` if you want Spark/DuckDb/Polars
pip install -r ci/requirements-2nightly.txt

Run Dask Benchmarks

pytest --benchmark tests/tpch/test_dask.py

Configure

By default we run Scale 100 (about 100 GB) on the cloud with Coiled. You can configure this by changing the values for _local and _scale in the conftest.py file in this directory (they're at the top).

Local Data Generation

If you want to run locally, you'll need to generate data. Run the following from the root directory of this repository.

python tests/tpch/generate_data.py --scale 10

Run Many Tests

When running on the cloud you can run many tests simultaneously. We recommend using pytest-xdist for this with the keywords:

  • -n 4 run four parallel jobs
  • --dist loadscope split apart by module
py.test --benchmark -n 4 --dist loadscope tests/tpch

Generate Plots

Timing outputs are dropped into benchmark.db in the root of this repository. You can generate charts analyzing results using either the notebook visualize.ipynb in this directory (recommended) or the generate-plot.py script in this directory. These require ibis and altair (not installed above).

These are both meant to be run from the root directory of this repository.

These pull out the most recent records for each query/library pairing. If you're changing scales and want to ensure clean results, you may want to nuke your benchmark.db file between experiments (it's ok, it'll regenerate automatically).