-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster apply transform tables #2 #215
Conversation
…nsform_tables speedup
Heaps faster, but still getting regressions on Ireland :( Any tips on how to debug regressions like this @olejandro ? I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..? |
Cool! :-) Let me check... @siddharth-krishna is the one behind the main-vs-branch diff approach in the readme, so he'd be the right person to comment on it. I normally just dump the dataframe with the affected data from the transform and try to find out what's wrong. |
xl2times/transforms.py
Outdated
commodity_groups["commodity"] = commodity_groups["commodity"].explode( | ||
ignore_index=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commodity_groups["commodity"] = commodity_groups["commodity"].explode( | |
ignore_index=True | |
) | |
commodity_groups = commodity_groups.explode("commodity", ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should address the regression on COM_GMAP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/headsmack
Good spotting, thanks :) All fixed!
Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland. I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180 If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck! |
added state save/diff tools to utils.
Thanks, I ended up a similar route - I've added some utils to save the state (model+tables) and then diff the main vs branch state. Pretty easy, and it spotted the symptom pretty quickly. Though @olejandro has already spotted the cause for me :) Utils are here if you're interested. |
# Conflicts: # xl2times/transforms.py
Second attempt at speeding up apply_transform_tables. This is just the minimum from #213 needed to apply the late explode method.
Overall Ireland benchmark speedup (local machine):
Explode=False: Ran Ireland in 10.94s. 90.0% (36629 correct, 3415 additional).
Explode=True: Ran Ireland in 14.51s. 90.0% (36614 correct, 3415 additional).