Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-1636729] Improve join/align performance by avoid unnecessary coalesce #2165

Merged
merged 15 commits into from
Aug 30, 2024

Conversation

sfc-gh-yzou
Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou commented Aug 26, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1636729

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.
    Concatenating series from the same dataframe requires joins today, but theoretically there is no join needed since they come from the same dataframe.

The reason those join are there is because concat calls join_on_index to join two series at a time, the join is only optimized today if the join is performed on the row position column, but join_on_index tries to creates a new column using coalesce even if the two columns to coalesce is the same column. Therefore, in the next join, the two columns to join on can not be recognized anymore.

In this pr, we tries to maximize the optimization opportunity by keeping the original column whenever possible, for example, if the two columns to coalesce is actually the same column, no actual coalesce is required.

@sfc-gh-yzou sfc-gh-yzou marked this pull request as ready for review August 27, 2024 20:21
@sfc-gh-yzou sfc-gh-yzou requested a review from a team as a code owner August 27, 2024 20:21
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-yzou thanks for the fix! I am leaving some comments and questions.

tests/integ/modin/test_concat.py Show resolved Hide resolved
tests/integ/modin/binary/test_binary_op.py Outdated Show resolved Hide resolved
@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1636729-join-align-improve branch from 0d262da to f34f045 Compare August 28, 2024 23:44
@sfc-gh-yzou sfc-gh-yzou added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Aug 29, 2024
@sfc-gh-yzou sfc-gh-yzou marked this pull request as draft August 29, 2024 01:25
@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1636729-join-align-improve branch from f34f045 to be93362 Compare August 29, 2024 01:26
@@ -1187,15 +1217,8 @@ def align(
# NULL NULL 2 NULL 4 e 2
coalesce_key_config = None
inherit_join_index = InheritJoinIndex.FROM_LEFT
# When it is `outer` align, we need to coalesce the align columns. However, if the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-nkumar this optimization is now performed in general for all join and align

@sfc-gh-yzou sfc-gh-yzou marked this pull request as ready for review August 29, 2024 18:33
# For 'inner' and 'left' join we use left join keys and for 'right' join we
# use right join keys.
# For 'left' and 'coalesce' align we use left join keys.
if how in ("asof", "outer"):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-nkumar When i start looking at this part of code now, it feels little bit wired, it seems we are deciding how the column coalesce happens, we do not look into the coalesce configure, but the join type. It is good in the sense that it tries to reduce the extra logic caller need to check, but it is kind of confusing in the sense that we also have an coalesce configure parameter there

Copy link
Contributor

@sfc-gh-nkumar sfc-gh-nkumar Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, IIRC current logic is to ignore the coalesce config for join types where coalesce is not required. We can probably assert if coalesce config is provided for join types which do not require assert. But I don't feel strongly either way.

# For 'inner' and 'left' join we use left join keys and for 'right' join we
# use right join keys.
# For 'left' and 'coalesce' align we use left join keys.
if how in ("asof", "outer"):
Copy link
Contributor

@sfc-gh-nkumar sfc-gh-nkumar Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, IIRC current logic is to ignore the coalesce config for join types where coalesce is not required. We can probably assert if coalesce config is provided for join types which do not require assert. But I don't feel strongly either way.

Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! LGTM

@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1636729-join-align-improve branch from 115c6df to 9b8d72c Compare August 30, 2024 02:25
@sfc-gh-yzou sfc-gh-yzou merged commit 0346b87 into main Aug 30, 2024
35 checks passed
@sfc-gh-yzou sfc-gh-yzou deleted the yzou-SNOW-1636729-join-align-improve branch August 30, 2024 16:38
@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants