-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache extracted EmbeddedXlTables based on xlsx file hashes #196
Conversation
CI is failing because there is no cache present yet, and regression tests run first on the PR branch and second on the main branch -- the main branch uses the cache created by the PR branch, so is much faster! When I get time, I'm planning to disable time-based regressions as a CI failure, and instead have CI comment on the PR with the difference in time if it is significant (since time measurements are anyway not super accurate as we run benchmarks in parallel). That way we can see the difference in time and decide whether to ignore / investigate as appropriate. |
Cool, thanks! Do I just merge it then? |
Not yet, I need to fix the cache in the GitHub Actions |
env: | ||
PY_VERSION: "3.11" | ||
REF_TIMES_model: "b488fb07f0899ee8b7e710c230b1a9414fa06f7d" | ||
REF_demos-xlsx: "f956db07a253d4f5c60e108791ab7bb2b8136690" | ||
REF_demos-dd: "2848a8a8e2fdcf0cdf7f83eefbdd563b0bb74e86" | ||
REF_tim: "e820d8002adc6b1526a3bffcc439219b28d0eed5" | ||
REF_tim-gams: "703f6a4e1d0bedd95c3ebdae534496f3a7e1b7cc" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are fixing the versions to be used for tests on the CI, right?
Great!
It looks like I was mistaken about the performance implications. Here are the total runtimes of running all the benchmarks on my laptop:
So the act of hashing and creating the cache in the first run adds 59s (7%) runtime, but once we have the cache, subsequent runs are 152s (19%) faster. I'm going to merge this in, assuming that that tradeoff is useful for now, but happy to discuss whether we want caching to be the default behaviour and iterate on this further. |
Thanks @siddharth-krishna! I believe this will definitely be useful, also when using the tool for scenario analysis. I guess, later on, we could develop this further to skip some of the transforms depending on what's changed? @SamRWest how long does it take to read AusTIMES? |
In the case of TIMES-US Model (TUSM), reading of the files takes about 6 minutes (ca. 25% of the total processing time). Although |
Nice - you beat me to it! I was considering adding basically this feature to speed up my AusTIMES runs :) Thanks! It might be worth trying caching with parquet files instead of pickles in future, they're usually lightning fast to read/write compared to pickles (although 0.8s is probably fast enough for me :) ), and the file sizes are much smaller. The downside is that you'd have to save one file per table though. Another potential issue is that the Speeds for Austimes are: |
Great to hear this is useful! Thanks for the suggestions, I've moved the discussion to an issue so we don't lose track of it: #199 |
This PR adds a feature to cache the extracted EmbeddedXlTables from input XLSX files by default (and removes the
--use_pkl
flag). This is because on some bigger benchmarksopenpyxl
takes ages to extract the tables, and the xl2times developer workflow rarely involves changing the XLSX files.The cache is stored in
xl2times/.cache/
and saves files based on the hash of the XLSX file. It also checks filename (and can be extended to check modification time) to avoid hash collisions. Another future improvement can be to have a maximum size for the cache directory.Change in runtime appears to be about 50% on CI (note that for this PR's CI, the cache is created when the tool is run first on the PR branch, and then the cache is used when it is run onmain
):(See comment below for more accurate performance implications)
There's also a new flat
--no_cache
to ignore the cache directory and read the files from scratch.@SamRWest just tagging you so you are aware of this (potentially confusing) behaviour change.