Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache extracted EmbeddedXlTables based on xlsx file hashes #196

Merged
merged 14 commits into from
Feb 26, 2024

Conversation

siddharth-krishna
Copy link
Collaborator

@siddharth-krishna siddharth-krishna commented Feb 23, 2024

This PR adds a feature to cache the extracted EmbeddedXlTables from input XLSX files by default (and removes the --use_pkl flag). This is because on some bigger benchmarks openpyxl takes ages to extract the tables, and the xl2times developer workflow rarely involves changing the XLSX files.

The cache is stored in xl2times/.cache/ and saves files based on the hash of the XLSX file. It also checks filename (and can be extended to check modification time) to avoid hash collisions. Another future improvement can be to have a maximum size for the cache directory.

Change in runtime appears to be about 50% on CI (note that for this PR's CI, the cache is created when the tool is run first on the PR branch, and then the cache is used when it is run on main):
(See comment below for more accurate performance implications)
image

There's also a new flat --no_cache to ignore the cache directory and read the files from scratch.

@SamRWest just tagging you so you are aware of this (potentially confusing) behaviour change.

@siddharth-krishna
Copy link
Collaborator Author

CI is failing because there is no cache present yet, and regression tests run first on the PR branch and second on the main branch -- the main branch uses the cache created by the PR branch, so is much faster!

When I get time, I'm planning to disable time-based regressions as a CI failure, and instead have CI comment on the PR with the difference in time if it is significant (since time measurements are anyway not super accurate as we run benchmarks in parallel). That way we can see the difference in time and decide whether to ignore / investigate as appropriate.

@olejandro
Copy link
Member

Cool, thanks! Do I just merge it then?

@siddharth-krishna
Copy link
Collaborator Author

Not yet, I need to fix the cache in the GitHub Actions

Comment on lines +17 to +24
env:
PY_VERSION: "3.11"
REF_TIMES_model: "b488fb07f0899ee8b7e710c230b1a9414fa06f7d"
REF_demos-xlsx: "f956db07a253d4f5c60e108791ab7bb2b8136690"
REF_demos-dd: "2848a8a8e2fdcf0cdf7f83eefbdd563b0bb74e86"
REF_tim: "e820d8002adc6b1526a3bffcc439219b28d0eed5"
REF_tim-gams: "703f6a4e1d0bedd95c3ebdae534496f3a7e1b7cc"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are fixing the versions to be used for tests on the CI, right?
Great!

@siddharth-krishna
Copy link
Collaborator Author

It looks like I was mistaken about the performance implications. Here are the total runtimes of running all the benchmarks on my laptop:

main         812s
PR-first-run 871s
PR-secd-run  658s

So the act of hashing and creating the cache in the first run adds 59s (7%) runtime, but once we have the cache, subsequent runs are 152s (19%) faster. I'm going to merge this in, assuming that that tradeoff is useful for now, but happy to discuss whether we want caching to be the default behaviour and iterate on this further.

@siddharth-krishna siddharth-krishna merged commit e1d1c8f into main Feb 26, 2024
1 check passed
@siddharth-krishna siddharth-krishna deleted the sidk/cache-xlsx branch February 26, 2024 16:05
@olejandro
Copy link
Member

Thanks @siddharth-krishna! I believe this will definitely be useful, also when using the tool for scenario analysis. I guess, later on, we could develop this further to skip some of the transforms depending on what's changed?

@SamRWest how long does it take to read AusTIMES?

@olejandro
Copy link
Member

In the case of TIMES-US Model (TUSM), reading of the files takes about 6 minutes (ca. 25% of the total processing time). Although process_wildcards takes most of the time, i.e. 63%, caching gives the overall processing time a nice boost!

@SamRWest
Copy link
Collaborator

SamRWest commented Feb 26, 2024

Nice - you beat me to it! I was considering adding basically this feature to speed up my AusTIMES runs :) Thanks!

It might be worth trying caching with parquet files instead of pickles in future, they're usually lightning fast to read/write compared to pickles (although 0.8s is probably fast enough for me :) ), and the file sizes are much smaller. The downside is that you'd have to save one file per table though.

Another potential issue is that the .cache/ dir gets created inside the xl2times module (i.e. beside xl2times/__init__.py). This seems (to me) a bit unusual. Would it be better to create it in the users' home dir (eg. pathlib.Path.home() / '.xl2times/cache)? That seems to be where most apps create their caches now - at least I seem to have lots of caches there.

Speeds for Austimes are:
With caching, first run:
Extracted (potentially cached) 555 tables, 34360 rows in 0:01:32.051419
With caching, second run:
Extracted (potentially cached) 555 tables, 34360 rows in 0:00:00.842077
Nice speedup :)

@siddharth-krishna siddharth-krishna mentioned this pull request Feb 27, 2024
4 tasks
@siddharth-krishna
Copy link
Collaborator Author

Great to hear this is useful! Thanks for the suggestions, I've moved the discussion to an issue so we don't lose track of it: #199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants