Cache extracted EmbeddedXlTables based on xlsx file hashes #196

siddharth-krishna · 2024-02-23T13:46:58Z

This PR adds a feature to cache the extracted EmbeddedXlTables from input XLSX files by default (and removes the --use_pkl flag). This is because on some bigger benchmarks openpyxl takes ages to extract the tables, and the xl2times developer workflow rarely involves changing the XLSX files.

The cache is stored in xl2times/.cache/ and saves files based on the hash of the XLSX file. It also checks filename (and can be extended to check modification time) to avoid hash collisions. Another future improvement can be to have a maximum size for the cache directory.

~~Change in runtime appears to be about 50% on CI (note that for this PR's CI, the cache is created when the tool is run first on the PR branch, and then the cache is used when it is run on main):~~
(See comment below for more accurate performance implications)

There's also a new flat --no_cache to ignore the cache directory and read the files from scratch.

@SamRWest just tagging you so you are aware of this (potentially confusing) behaviour change.

siddharth-krishna · 2024-02-26T05:36:26Z

CI is failing because there is no cache present yet, and regression tests run first on the PR branch and second on the main branch -- the main branch uses the cache created by the PR branch, so is much faster!

When I get time, I'm planning to disable time-based regressions as a CI failure, and instead have CI comment on the PR with the difference in time if it is significant (since time measurements are anyway not super accurate as we run benchmarks in parallel). That way we can see the difference in time and decide whether to ignore / investigate as appropriate.

olejandro · 2024-02-26T14:50:51Z

Cool, thanks! Do I just merge it then?

siddharth-krishna · 2024-02-26T14:52:10Z

Not yet, I need to fix the cache in the GitHub Actions

olejandro · 2024-02-26T14:52:50Z

.github/workflows/ci.yml

+    env:
+      PY_VERSION: "3.11"
+      REF_TIMES_model: "b488fb07f0899ee8b7e710c230b1a9414fa06f7d"
+      REF_demos-xlsx: "f956db07a253d4f5c60e108791ab7bb2b8136690"
+      REF_demos-dd: "2848a8a8e2fdcf0cdf7f83eefbdd563b0bb74e86"
+      REF_tim: "e820d8002adc6b1526a3bffcc439219b28d0eed5"
+      REF_tim-gams: "703f6a4e1d0bedd95c3ebdae534496f3a7e1b7cc"
+


Here we are fixing the versions to be used for tests on the CI, right?
Great!

This reverts commit fef73f0.

siddharth-krishna · 2024-02-26T16:04:21Z

It looks like I was mistaken about the performance implications. Here are the total runtimes of running all the benchmarks on my laptop:

main         812s
PR-first-run 871s
PR-secd-run  658s

So the act of hashing and creating the cache in the first run adds 59s (7%) runtime, but once we have the cache, subsequent runs are 152s (19%) faster. I'm going to merge this in, assuming that that tradeoff is useful for now, but happy to discuss whether we want caching to be the default behaviour and iterate on this further.

olejandro · 2024-02-26T17:22:34Z

Thanks @siddharth-krishna! I believe this will definitely be useful, also when using the tool for scenario analysis. I guess, later on, we could develop this further to skip some of the transforms depending on what's changed?

@SamRWest how long does it take to read AusTIMES?

olejandro · 2024-02-26T18:00:31Z

In the case of TIMES-US Model (TUSM), reading of the files takes about 6 minutes (ca. 25% of the total processing time). Although process_wildcards takes most of the time, i.e. 63%, caching gives the overall processing time a nice boost!

SamRWest · 2024-02-26T23:16:49Z

Nice - you beat me to it! I was considering adding basically this feature to speed up my AusTIMES runs :) Thanks!

It might be worth trying caching with parquet files instead of pickles in future, they're usually lightning fast to read/write compared to pickles (although 0.8s is probably fast enough for me :) ), and the file sizes are much smaller. The downside is that you'd have to save one file per table though.

Another potential issue is that the .cache/ dir gets created inside the xl2times module (i.e. beside xl2times/__init__.py). This seems (to me) a bit unusual. Would it be better to create it in the users' home dir (eg. pathlib.Path.home() / '.xl2times/cache)? That seems to be where most apps create their caches now - at least I seem to have lots of caches there.

Speeds for Austimes are:
With caching, first run:
Extracted (potentially cached) 555 tables, 34360 rows in 0:01:32.051419
With caching, second run:
Extracted (potentially cached) 555 tables, 34360 rows in 0:00:00.842077
Nice speedup :)

siddharth-krishna · 2024-02-27T06:54:00Z

Great to hear this is useful! Thanks for the suggestions, I've moved the discussion to an issue so we don't lose track of it: #199

siddharth-krishna added 6 commits February 23, 2024 12:22

CI: fix bug in parsing additional lines

072acf7

Cache extracted XlTables to speedup repetitive runs

422450a

CI: cache the xlsx cache directory to benefit from the speedup

128de34

Remove --use_pkl and add --no_cache

d642ef5

Upgrade Python version to 3.11 in order to use hashlib.file_digest

f53b52e

Add commit refs to all CI checkouts and use refs as cache key

03d4d4a

Merge branch 'main' into sidk/cache-xlsx

9f9a7e6

olejandro reviewed Feb 26, 2024

View reviewed changes

Fix GitHub Actions cache directory

8f35540

olejandro approved these changes Feb 26, 2024

View reviewed changes

siddharth-krishna added 6 commits February 26, 2024 15:08

Fix the fix

d16843c

Debug why cache save is failing

fef73f0

Debug

d7523a8

Revert "Debug why cache save is failing"

baeb699

This reverts commit fef73f0.

Better logs

0479a1d

Fix cache dir (this time I'm serious)

6300ffd

siddharth-krishna merged commit e1d1c8f into main Feb 26, 2024
1 check passed

siddharth-krishna deleted the sidk/cache-xlsx branch February 26, 2024 16:05

siddharth-krishna mentioned this pull request Feb 27, 2024

XLSX cache improvements #199

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache extracted EmbeddedXlTables based on xlsx file hashes #196

Cache extracted EmbeddedXlTables based on xlsx file hashes #196

siddharth-krishna commented Feb 23, 2024 •

edited

Loading

siddharth-krishna commented Feb 26, 2024

olejandro commented Feb 26, 2024

siddharth-krishna commented Feb 26, 2024

olejandro Feb 26, 2024

siddharth-krishna commented Feb 26, 2024

olejandro commented Feb 26, 2024

olejandro commented Feb 26, 2024

SamRWest commented Feb 26, 2024 •

edited

Loading

siddharth-krishna commented Feb 27, 2024

Cache extracted EmbeddedXlTables based on xlsx file hashes #196

Cache extracted EmbeddedXlTables based on xlsx file hashes #196

Conversation

siddharth-krishna commented Feb 23, 2024 • edited Loading

siddharth-krishna commented Feb 26, 2024

olejandro commented Feb 26, 2024

siddharth-krishna commented Feb 26, 2024

olejandro Feb 26, 2024

Choose a reason for hiding this comment

siddharth-krishna commented Feb 26, 2024

olejandro commented Feb 26, 2024

olejandro commented Feb 26, 2024

SamRWest commented Feb 26, 2024 • edited Loading

siddharth-krishna commented Feb 27, 2024

siddharth-krishna commented Feb 23, 2024 •

edited

Loading

SamRWest commented Feb 26, 2024 •

edited

Loading