-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-1636729] Improve join/align performance by avoid unnecessary coalesce #2165
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-yzou thanks for the fix! I am leaving some comments and questions.
0d262da
to
f34f045
Compare
f34f045
to
be93362
Compare
@@ -1187,15 +1217,8 @@ def align( | |||
# NULL NULL 2 NULL 4 e 2 | |||
coalesce_key_config = None | |||
inherit_join_index = InheritJoinIndex.FROM_LEFT | |||
# When it is `outer` align, we need to coalesce the align columns. However, if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-nkumar this optimization is now performed in general for all join and align
# For 'inner' and 'left' join we use left join keys and for 'right' join we | ||
# use right join keys. | ||
# For 'left' and 'coalesce' align we use left join keys. | ||
if how in ("asof", "outer"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-nkumar When i start looking at this part of code now, it feels little bit wired, it seems we are deciding how the column coalesce happens, we do not look into the coalesce configure, but the join type. It is good in the sense that it tries to reduce the extra logic caller need to check, but it is kind of confusing in the sense that we also have an coalesce configure parameter there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, IIRC current logic is to ignore the coalesce config for join types where coalesce is not required. We can probably assert if coalesce config is provided for join types which do not require assert. But I don't feel strongly either way.
# For 'inner' and 'left' join we use left join keys and for 'right' join we | ||
# use right join keys. | ||
# For 'left' and 'coalesce' align we use left join keys. | ||
if how in ("asof", "outer"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, IIRC current logic is to ignore the coalesce config for join types where coalesce is not required. We can probably assert if coalesce config is provided for join types which do not require assert. But I don't feel strongly either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! LGTM
115c6df
to
9b8d72c
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1636729
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Concatenating series from the same dataframe requires joins today, but theoretically there is no join needed since they come from the same dataframe.
The reason those join are there is because concat calls join_on_index to join two series at a time, the join is only optimized today if the join is performed on the row position column, but join_on_index tries to creates a new column using coalesce even if the two columns to coalesce is the same column. Therefore, in the next join, the two columns to join on can not be recognized anymore.
In this pr, we tries to maximize the optimization opportunity by keeping the original column whenever possible, for example, if the two columns to coalesce is actually the same column, no actual coalesce is required.