Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/transforms speedup #186

Merged
merged 32 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e310134
Modified benchmark run in Windows to use direct function calls instea…
SamRWest Feb 14, 2024
43af588
testing regression code
SamRWest Feb 14, 2024
57bbb4d
limit max_workers on windows to reduce peak ram
SamRWest Feb 14, 2024
13f35ee
more process count limiting
SamRWest Feb 14, 2024
1c8e11f
move max_workers to utils.
SamRWest Feb 14, 2024
552af5b
Added more detail to benchmarking section in readme
SamRWest Feb 14, 2024
52ae6b5
added debug flag for running benchmarks as function calls so breakpoi…
SamRWest Feb 14, 2024
4572d0e
If benchmark fails in CI, try again without the --dd flag, to see if …
SamRWest Feb 14, 2024
beb05ad
Split into separate CI tasks with/without GAMS license
SamRWest Feb 14, 2024
f24ca91
update ci syntax
SamRWest Feb 14, 2024
319fa1a
fixed secret check
SamRWest Feb 14, 2024
8daecc4
add working dir
SamRWest Feb 14, 2024
1b8b0c9
Pass CI if no-GAMS regression tests pass
SamRWest Feb 14, 2024
bde5e42
Merge branch 'main' into feature/benchmarks_on_windows
SamRWest Feb 14, 2024
dd6fc95
Addressed review comments
SamRWest Feb 14, 2024
e5c9b1f
Merge remote-tracking branch 'github-etsap/feature/benchmarks_on_wind…
SamRWest Feb 14, 2024
00fed2b
vectorised loops in transforms.generate_commodity_groups()
SamRWest Feb 14, 2024
fc5ee0b
add tqdm package for progress bars
SamRWest Feb 14, 2024
e4d5bda
added percentage changes to benchmark results
SamRWest Feb 14, 2024
a9cae65
Added unit tests, test data, pytest lib/config and CI step
SamRWest Feb 14, 2024
25c48ee
removed commented calls, fixed typos
SamRWest Feb 15, 2024
7f399e5
Merge branch 'main' into feature/benchmarks_on_windows
SamRWest Feb 14, 2024
178e32e
Merge remote-tracking branch 'github-etsap/feature/benchmarks_on_wind…
SamRWest Feb 14, 2024
cd9d830
vectorised loops in transforms.generate_commodity_groups()
SamRWest Feb 14, 2024
e805543
add tqdm package for progress bars
SamRWest Feb 14, 2024
6078f4f
added percentage changes to benchmark results
SamRWest Feb 14, 2024
20becfa
Added unit tests, test data, pytest lib/config and CI step
SamRWest Feb 14, 2024
cb82f83
removed commented calls, fixed typos
SamRWest Feb 15, 2024
6127c04
Merge remote-tracking branch 'csiro-github/feature/transforms_speedup…
SamRWest Feb 15, 2024
e29c8b0
renamed input param
SamRWest Feb 15, 2024
fdb19a8
Merge branch 'main-github-etsap' into feature/transforms_speedup
SamRWest Feb 15, 2024
83ec9c7
addressed final comments in #184
SamRWest Feb 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 31 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ name: CI
on:
# Triggers the workflow on push or pull request events but only for the main branch
push:
branches: [main]
branches: [ main ]
pull_request:
branches: [main]
branches: [ main ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Expand Down Expand Up @@ -34,6 +34,12 @@ jobs:
pre-commit install
pre-commit run --all-files

- name: Run unit tests
working-directory: xl2times
run: |
source .venv/bin/activate
pytest

# ---------- Prepare ETSAP Demo models

- uses: actions/checkout@v3
Expand Down Expand Up @@ -69,6 +75,9 @@ jobs:
# ---------- Install GAMS

- name: Install GAMS
env:
GAMS_LICENSE: ${{ secrets.GAMS_LICENSE }}
if: ${{ env.GAMS_LICENSE != '' }}
run: |
curl https://d37drm4t2jghv5.cloudfront.net/distributions/44.1.0/linux/linux_x64_64_sfx.exe -o linux_x64_64_sfx.exe
chmod +x linux_x64_64_sfx.exe
Expand All @@ -81,17 +90,18 @@ jobs:
mkdir -p $HOME/.local/share/GAMS
echo "$GAMS_LICENSE" > $HOME/.local/share/GAMS/gamslice.txt
ls -l $HOME/.local/share/GAMS/
env:
GAMS_LICENSE: ${{ secrets.GAMS_LICENSE }}


# ---------- Run tool, check for regressions

- name: Run tool on all benchmarks
env:
GAMS_LICENSE: ${{ secrets.GAMS_LICENSE }}
if: ${{ env.GAMS_LICENSE != '' }}
working-directory: xl2times
# Use tee to also save the output to out.txt so that the summary table can be
# printed again in the next step.
# Save the return code to retcode.txt so that the next step can fail the action
# if run_benchmarks.py failed.
run: |
source .venv/bin/activate
export PATH=$PATH:$GITHUB_WORKSPACE/GAMS/gams44.1_linux_x64_64_sfx
Expand All @@ -101,6 +111,22 @@ jobs:
| tee out.txt; \
echo ${PIPESTATUS[0]} > retcode.txt)

- name: Run CSV-only regression tests (no GAMS license)
env:
GAMS_LICENSE: ${{ secrets.GAMS_LICENSE }}
if: ${{ env.GAMS_LICENSE == '' }}
working-directory: xl2times
# Run without --dd flag if GAMS license secret doesn't exist.
# Useful for testing for (CSV) regressions in forks before creating PRs.
run: |
source .venv/bin/activate
export PATH=$PATH:$GITHUB_WORKSPACE/GAMS/gams44.1_linux_x64_64_sfx
(python utils/run_benchmarks.py benchmarks.yml \
--times_dir $GITHUB_WORKSPACE/TIMES_model \
--verbose \
| tee out.txt; \
echo ${PIPESTATUS[0]} > retcode.txt)

- name: Print summary
working-directory: xl2times
run: |
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ ground_truth/*
*.pyproj.*
speedscope.json
*.pkl
.venv/
.venv*/
benchmarks/
.idea/
.python-version
docs/_build/
docs/api/
.coverage
43 changes: 41 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,45 @@ git commit --no-verify

See our GitHub Actions CI `.github/workflows/ci.yml` and the utility script `utils/run_benchmarks.py` to see how to run the tool on the DemoS models.

In summary, use the commands below to clone the benchmarks data into your local `benchmarks` dir.
Note that this assumes you have access to all these repositories (some are private and
you'll have to request access) - if not, comment out the inaccessible benchmarks from `benchmakrs.yml` before running.

```bash
mkdir benchmarks
# Get VEDA example models and reference DD files
# XLSX files are in private repo for licensing reasons, please request access or replace with your own licensed VEDA example files.
git clone [email protected]:olejandro/demos-xlsx.git benchmarks/xlsx/
git clone [email protected]:olejandro/demos-dd.git benchmarks/dd/

# Get Ireland model and reference DD files
git clone [email protected]:esma-cgep/tim.git benchmarks/xlsx/Ireland
git clone [email protected]:esma-cgep/tim-gams.git benchmarks/dd/Ireland
```
Then to run the benchmarks:
```bash
# Run a only a single benchmark by name (see benchmarks.yml for name list)
python utils/run_benchmarks.py benchmarks.yml --verbose --run DemoS_001-all | tee out.txt

# Run all benchmarks (without GAMS run, just comparing CSV data)
python utils/run_benchmarks.py benchmarks.yml --verbose | tee out.txt


# Run benchmarks with regression tests vs main branch
git branch feature/your_new_changes --checkout
# ... make your code changes here ...
git commit -a -m "your commit message" # code must be committed for comparison to `main` branch to run.
python utils/run_benchmarks.py benchmarks.yml --verbose | tee out.txt
```
At this point, if you haven't broken anything you should see something like:
```
Change in runtime: +2.97s
Change in correct rows: +0
Change in additional rows: +0
No regressions. You're awesome!
```
If you have a large increase in runtime, a decrease in correct rows or fewer rows being produced, then you've broken something and will need to figure out how to fix it.

### Debugging Regressions

If your change is causing regressions on one of the benchmarks, a useful way to debug and find the difference is to run the tool in verbose mode and compare the intermediate tables. For example, if your branch has regressions on Demo 1:
Expand All @@ -94,8 +133,8 @@ python -m pip install --upgrade build
python -m pip install --upgrade twine
rm -rf dist
python -m build
python -m twine upload dist/*
```
python -m twine upload dist/*```

SamRWest marked this conversation as resolved.
Show resolved Hide resolved

## Contributing

Expand Down
34 changes: 24 additions & 10 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["setuptools>=61.0.0", "wheel"]
build-backend = "setuptools.build_meta"

[tool.setuptools]
packages = ["xl2times"]
packages = ["xl2times", "utils"]

[project]
name = "xl2times"
Expand All @@ -14,20 +14,28 @@ requires-python = ">=3.10"
license = { file = "LICENSE" }
keywords = []
classifiers = [
"Development Status :: 4 - Beta",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Development Status :: 4 - Beta",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
]
dependencies = [
"GitPython >= 3.1.31, < 3.2",
"more-itertools",
"openpyxl >= 3.0, < 3.1",
"pandas >= 2.1",
"GitPython >= 3.1.31, < 3.2",
"more-itertools",
"openpyxl >= 3.0, < 3.1",
"pandas >= 2.1",
"pyarrow",
"tqdm",
]

[project.optional-dependencies]
dev = ["black", "pre-commit", "tabulate"]
dev = [
"black",
"pre-commit",
"tabulate",
"pytest",
"pytest-cov"
]

[project.urls]
Documentation = "https://github.com/etsap-TIMES/xl2times#readme"
Expand All @@ -36,3 +44,9 @@ Source = "https://github.com/etsap-TIMES/xl2times"

[project.scripts]
xl2times = "xl2times.__main__:main"

[tool.pytest.ini_options]
# don't print runtime warnings
filterwarnings = ["ignore::DeprecationWarning", "ignore::UserWarning", "ignore::FutureWarning"]
# show output, print test coverage report
addopts = '-s --durations=0 --durations-min=5.0 --tb=native --cov-report term --cov-report html --cov=xl2times --cov=utils'
Binary file added tests/data/austimes_pcg_test_data.parquet
Binary file not shown.
Binary file added tests/data/comm_groups_austimes_test_data.parquet
Binary file not shown.
67 changes: 67 additions & 0 deletions tests/test_transforms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from datetime import datetime

import pandas as pd

from xl2times import transforms
from xl2times.transforms import (
_process_comm_groups_vectorised,
_count_comm_group_vectorised,
)

pd.set_option(
"display.max_rows",
20,
"display.max_columns",
20,
"display.width",
300,
"display.max_colwidth",
75,
"display.precision",
3,
)


class TestTransforms:
def test_generate_commodity_groups(self):
"""
Tests that the _count_comm_group_vectorised function works as expected.
Full austimes run:
Vectorised version took 0.021999 seconds
looped version took 966.653371 seconds
43958x speedup
"""
# data extracted immediately before the original for loops
comm_groups = pd.read_parquet(
"tests/data/comm_groups_austimes_test_data.parquet"
).drop(columns=["commoditygroup"])

# filter data so test runs faster
comm_groups = comm_groups.query("region in ['ACT', 'NSW']")

comm_groups2 = comm_groups.copy()
_count_comm_group_vectorised(comm_groups2)
assert comm_groups2.drop(columns=["commoditygroup"]).equals(comm_groups)
assert comm_groups2.shape == (comm_groups.shape[0], comm_groups.shape[1] + 1)

def test_default_pcg_vectorised(self):
"""Tests the default primary commodity group identification logic runs correctly.
Full austimes run:
Looped version took 1107.66 seconds
Vectorised version took 62.85 seconds
"""

# data extracted immediately before the original for loops
comm_groups = pd.read_parquet("tests/data/austimes_pcg_test_data.parquet")

comm_groups = comm_groups[(comm_groups["region"].isin(["ACT", "NT"]))]
comm_groups2 = _process_comm_groups_vectorised(
comm_groups.copy(), transforms.csets_ordered_for_pcg
)
assert comm_groups2 is not None and not comm_groups2.empty
assert comm_groups2.shape == (comm_groups.shape[0], comm_groups.shape[1] + 1)
assert comm_groups2.drop(columns=["DefaultVedaPCG"]).equals(comm_groups)


if __name__ == "__main__":
TestTransforms().test_default_pcg_vectorised()
Empty file added utils/__init__.py
Empty file.
8 changes: 6 additions & 2 deletions utils/dd_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,13 +216,17 @@ def convert_dd_to_tabular(
return


if __name__ == "__main__":
def main(arg_list: None | list[str] = None):
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"input_dir", type=str, help="Input directory containing .dd files."
)
args_parser.add_argument(
"output_dir", type=str, help="Output directory to save the .csv files in."
)
args = args_parser.parse_args()
args = args_parser.parse_args(arg_list)
convert_dd_to_tabular(args.input_dir, args.output_dir, generate_headers_by_attr())


if __name__ == "__main__":
main()
Loading
Loading