Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster apply transform tables #2 #215

Merged
merged 6 commits into from
Mar 13, 2024

Conversation

SamRWest
Copy link
Collaborator

Second attempt at speeding up apply_transform_tables. This is just the minimum from #213 needed to apply the late explode method.

Overall Ireland benchmark speedup (local machine):
Explode=False: Ran Ireland in 10.94s. 90.0% (36629 correct, 3415 additional).
Explode=True: Ran Ireland in 14.51s. 90.0% (36614 correct, 3415 additional).

@SamRWest
Copy link
Collaborator Author

Heaps faster, but still getting regressions on Ireland :(

Any tips on how to debug regressions like this @olejandro ?

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

@olejandro
Copy link
Member

Cool! :-) Let me check...

@siddharth-krishna is the one behind the main-vs-branch diff approach in the readme, so he'd be the right person to comment on it.

I normally just dump the dataframe with the affected data from the transform and try to find out what's wrong.

Comment on lines 2427 to 2429
commodity_groups["commodity"] = commodity_groups["commodity"].explode(
ignore_index=True
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
commodity_groups["commodity"] = commodity_groups["commodity"].explode(
ignore_index=True
)
commodity_groups = commodity_groups.explode("commodity", ignore_index=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should address the regression on COM_GMAP

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/headsmack
Good spotting, thanks :) All fixed!

@siddharth-krishna
Copy link
Collaborator

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

@SamRWest
Copy link
Collaborator Author

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

Thanks, I ended up a similar route - I've added some utils to save the state (model+tables) and then diff the main vs branch state. Pretty easy, and it spotted the symptom pretty quickly. Though @olejandro has already spotted the cause for me :) Utils are here if you're interested.

@SamRWest SamRWest marked this pull request as ready for review March 13, 2024 22:25
@SamRWest SamRWest merged commit 7d0d739 into main Mar 13, 2024
1 check passed
@SamRWest SamRWest deleted the samw/faster_apply_transform_tables_#2 branch March 13, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants