faster apply transform tables #2 #215

SamRWest · 2024-03-13T04:19:52Z

Second attempt at speeding up apply_transform_tables. This is just the minimum from #213 needed to apply the late explode method.

Overall Ireland benchmark speedup (local machine):
Explode=False: Ran Ireland in 10.94s. 90.0% (36629 correct, 3415 additional).
Explode=True: Ran Ireland in 14.51s. 90.0% (36614 correct, 3415 additional).

…nsform_tables speedup

…duplicates.

add explode transform test

SamRWest · 2024-03-13T04:48:17Z

Heaps faster, but still getting regressions on Ireland :(

Any tips on how to debug regressions like this @olejandro ?

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

olejandro · 2024-03-13T06:05:16Z

Cool! :-) Let me check...

@siddharth-krishna is the one behind the main-vs-branch diff approach in the readme, so he'd be the right person to comment on it.

I normally just dump the dataframe with the affected data from the transform and try to find out what's wrong.

olejandro · 2024-03-13T06:21:17Z

xl2times/transforms.py

+        commodity_groups["commodity"] = commodity_groups["commodity"].explode(
+            ignore_index=True
+        )


Suggested change

commodity_groups["commodity"] = commodity_groups["commodity"].explode(

ignore_index=True

)

commodity_groups = commodity_groups.explode("commodity", ignore_index=True)

This should address the regression on COM_GMAP

/headsmack
Good spotting, thanks :) All fixed!

siddharth-krishna · 2024-03-13T06:37:14Z

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

added state save/diff tools to utils.

SamRWest · 2024-03-13T22:23:57Z

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

Thanks, I ended up a similar route - I've added some utils to save the state (model+tables) and then diff the main vs branch state. Pretty easy, and it spotted the symptom pretty quickly. Though @olejandro has already spotted the cause for me :) Utils are here if you're interested.

# Conflicts: # xl2times/transforms.py

SamRWest added 3 commits March 13, 2024 14:12

minimal changes to implement late wildcard expansion of for apply_tra…

a147f5c

…nsform_tables speedup

fixed regression by exploding model.commodity_groups before removing …

e269829

…duplicates.

fix tests

ccaf53b

add explode transform test

olejandro reviewed Mar 13, 2024

View reviewed changes

SamRWest added 2 commits March 14, 2024 08:34

added pipeline save/diff tools

b8213d6

fixed ireland regressions

e4e57ff

added state save/diff tools to utils.

olejandro approved these changes Mar 13, 2024

View reviewed changes

SamRWest marked this pull request as ready for review March 13, 2024 22:25

SamRWest requested a review from siddharth-krishna March 13, 2024 22:25

SamRWest mentioned this pull request Mar 13, 2024

Add the ruff linter to the pre-commit check #214

Merged

Merge branch 'main' into samw/faster_apply_transform_tables_#2

f0db36b

# Conflicts: # xl2times/transforms.py

SamRWest merged commit 7d0d739 into main Mar 13, 2024
1 check passed

SamRWest deleted the samw/faster_apply_transform_tables_#2 branch March 13, 2024 23:12

SamRWest mentioned this pull request Mar 13, 2024

Faster apply_transform_tables() #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster apply transform tables #2 #215

faster apply transform tables #2 #215

SamRWest commented Mar 13, 2024

SamRWest commented Mar 13, 2024

olejandro commented Mar 13, 2024

olejandro Mar 13, 2024

olejandro Mar 13, 2024

SamRWest Mar 13, 2024

siddharth-krishna commented Mar 13, 2024

SamRWest commented Mar 13, 2024

faster apply transform tables #2 #215

faster apply transform tables #2 #215

Conversation

SamRWest commented Mar 13, 2024

SamRWest commented Mar 13, 2024

olejandro commented Mar 13, 2024

olejandro Mar 13, 2024

Choose a reason for hiding this comment

olejandro Mar 13, 2024

Choose a reason for hiding this comment

SamRWest Mar 13, 2024

Choose a reason for hiding this comment

siddharth-krishna commented Mar 13, 2024

SamRWest commented Mar 13, 2024