Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

Merged
merged 5 commits into from
Sep 11, 2023

Conversation

sfc-gh-stan
Copy link
Collaborator

@sfc-gh-stan sfc-gh-stan commented Aug 14, 2023

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Also fixes SNOW-831813: Unsupported subquery type cannot be evaluated of leftsemi join #875 and SNOW-837419: Self-join on Snowpark 1.4.0 causes an invalid_identifier error #886

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Fixes DataFrame.copy for simplifier path by make sure attributes are copied from the subquery so that NamedExpression.expr_id is regenerated. As a result, df1.filter(col("a") > 1).a has a different expression id than df1.a, which enables basic diamond shaped joins like

df2 = df1.filter(col("a") > 1).sort("b")
df1.join(df2, df1.a == df2.a)

@sfc-gh-stan sfc-gh-stan requested a review from a team as a code owner August 14, 2023 15:43
@sfc-gh-stan sfc-gh-stan changed the title SNOW-841405 Fix df copy and enable diamond shaped joins for simplifier SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier Aug 14, 2023
Comment on lines +1151 to +1152
assert df1.a._expression.expr_id != df2.a._expression.expr_id

Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we testing expr_ids are not the same. Does it affect dataframes in weird ways if they are same? Can we test that those subsequent side effects of expr_ids being the same are tested instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are the same we won't be able to perform the join.

for v in resolved_children.values():
in the analyzer counts duplicate expression id's in the resolved children and delete the <expr_id, alias name> entries in the analyzer's map to prevent ambiguous references, but then analyzer will translate df1.a to the "A", where the actual name of df1.a is suffixed. So the generated query will fail.

Copy link
Collaborator

@sfc-gh-sfan sfc-gh-sfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % some curious questions. Please also address Afroz's comment on test :)

Comment on lines +182 to +185
self.df_aliased_col_name_to_real_col_name: DefaultDict[
str, Dict[str, str]
] = defaultdict(dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what triggered this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bug actually, I intended it to be what the type hints suggests: a defaultdict of dictionaries. This PR fixes dataframe.copy to work for simplifier path, so this is caught.

@sfc-gh-yixie
Copy link
Collaborator

Could you use ./tests/perf/perf_runner.py to test the performance and memory consumption before and after this change when the number of columns is 10, 100, 1000?

@sfc-gh-stan
Copy link
Collaborator Author

Performance test results

Main

(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 100 -c 20 -m -s            
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=100, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.9 MiB    117.9 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.9 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    120.6 MiB     -0.2 MiB         101       for i in range(ncalls):
    44    120.6 MiB      2.5 MiB         100           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    120.6 MiB     -0.0 MiB           1       return df

Client side time elapsed:  4.480346918106079
Time elapsed per API call:  0.04480346918106079
SQL execution time elapsed:  0.12990498542785645
Total execution time elapsed:  4.6102519035339355
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 200 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=200, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.7 MiB    117.7 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.7 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    123.9 MiB     -5.4 MiB         201       for i in range(ncalls):
    44    123.9 MiB      0.8 MiB         200           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    123.9 MiB      0.0 MiB           1       return df

Client side time elapsed:  17.073092937469482
Time elapsed per API call:  0.08536546468734742
SQL execution time elapsed:  0.14729690551757812
Total execution time elapsed:  17.22038984298706
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 400 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=400, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.6 MiB    118.6 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.7 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    134.0 MiB     -7.6 MiB         401       for i in range(ncalls):
    44    134.0 MiB      7.8 MiB         400           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    134.0 MiB      0.0 MiB           1       return df

Client side time elapsed:  68.5808379650116
Time elapsed per API call:  0.171452094912529
SQL execution time elapsed:  0.711745023727417
Total execution time elapsed:  69.29258298873901
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 800 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=800, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.0 MiB    118.0 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.0 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    142.5 MiB     -4.9 MiB         801       for i in range(ncalls):
    44    142.5 MiB     19.7 MiB         800           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    142.5 MiB      0.0 MiB           1       return df

Client side time elapsed:  278.7850019931793
Time elapsed per API call:  0.34848125249147416
SQL execution time elapsed:  0.8526611328125
Total execution time elapsed:  279.6376631259918

fix-df-copy-and-diamond-shaped-joins-for-simplifier

(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 100 -c 20 -m -s             
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=100, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    119.1 MiB    119.1 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    119.1 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    121.6 MiB     -0.4 MiB         101       for i in range(ncalls):
    44    121.6 MiB      2.1 MiB         100           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    121.6 MiB      0.0 MiB           1       return df

Client side time elapsed:  4.90127420425415
Time elapsed per API call:  0.0490127420425415
SQL execution time elapsed:  0.17809391021728516
Total execution time elapsed:  5.0793681144714355
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 200 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=200, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.6 MiB    118.6 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.6 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    125.1 MiB     -4.1 MiB         201       for i in range(ncalls):
    44    125.1 MiB      2.5 MiB         200           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    125.1 MiB      0.0 MiB           1       return df

Client side time elapsed:  18.77964496612549
Time elapsed per API call:  0.09389822483062744
SQL execution time elapsed:  0.26385974884033203
Total execution time elapsed:  19.04350471496582
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 400 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=400, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.8 MiB    117.8 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.8 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    134.3 MiB      0.0 MiB         401       for i in range(ncalls):
    44    134.3 MiB     16.5 MiB         400           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    134.3 MiB      0.0 MiB           1       return df

Client side time elapsed:  72.3506338596344
Time elapsed per API call:  0.180876584649086
SQL execution time elapsed:  0.540585994720459
Total execution time elapsed:  72.89121985435486
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 800 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=800, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.4 MiB    117.4 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.4 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    140.7 MiB     -0.0 MiB         801       for i in range(ncalls):
    44    140.7 MiB     23.3 MiB         800           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    140.7 MiB      0.0 MiB           1       return df

Client side time elapsed:  293.10107493400574
Time elapsed per API call:  0.36637634366750715
SQL execution time elapsed:  0.8822462558746338
Total execution time elapsed:  293.98332118988037

@sfc-gh-stan sfc-gh-stan force-pushed the fix-df-copy-and-diamond-shaped-joins-for-simplifier branch from a7ea86b to 40cf9e4 Compare September 8, 2023 23:35
@sfc-gh-stan sfc-gh-stan enabled auto-merge (squash) September 11, 2023 14:14
@sfc-gh-stan sfc-gh-stan merged commit d212d69 into main Sep 11, 2023
50 checks passed
@sfc-gh-stan sfc-gh-stan deleted the fix-df-copy-and-diamond-shaped-joins-for-simplifier branch September 11, 2023 14:53
@github-actions github-actions bot locked and limited conversation to collaborators Sep 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNOW-831813: Unsupported subquery type cannot be evaluated of leftsemi join
4 participants