SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

sfc-gh-stan · 2023-08-14T15:43:36Z

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Also fixes SNOW-831813: Unsupported subquery type cannot be evaluated of leftsemi join #875 and SNOW-837419: Self-join on Snowpark 1.4.0 causes an invalid_identifier error #886
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
Please describe how your code solves the related issue.

Fixes DataFrame.copy for simplifier path by make sure attributes are copied from the subquery so that NamedExpression.expr_id is regenerated. As a result, df1.filter(col("a") > 1).a has a different expression id than df1.a, which enables basic diamond shaped joins like

df2 = df1.filter(col("a") > 1).sort("b")
df1.join(df2, df1.a == df2.a)

sfc-gh-aalam · 2023-08-14T19:11:04Z

tests/integ/scala/test_dataframe_join_suite.py

+    assert df1.a._expression.expr_id != df2.a._expression.expr_id
+


why are we testing expr_ids are not the same. Does it affect dataframes in weird ways if they are same? Can we test that those subsequent side effects of expr_ids being the same are tested instead?

If they are the same we won't be able to perform the join.

snowpark-python/src/snowflake/snowpark/_internal/analyzer/analyzer.py

Line 732 in df013e5

for v in resolved_children.values():

in the analyzer counts duplicate expression id's in the resolved children and delete the <expr_id, alias name> entries in the analyzer's map to prevent ambiguous references, but then analyzer will translate df1.a to the "A", where the actual name of df1.a is suffixed. So the generated query will fail.

sfc-gh-sfan

LGTM % some curious questions. Please also address Afroz's comment on test :)

sfc-gh-sfan · 2023-08-14T20:01:06Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

+        self.df_aliased_col_name_to_real_col_name: DefaultDict[
+            str, Dict[str, str]
+        ] = defaultdict(dict)


Curious what triggered this change?

This is bug actually, I intended it to be what the type hints suggests: a defaultdict of dictionaries. This PR fixes dataframe.copy to work for simplifier path, so this is caught.

src/snowflake/snowpark/_internal/analyzer/select_statement.py

sfc-gh-yixie · 2023-09-05T13:27:42Z

Could you use ./tests/perf/perf_runner.py to test the performance and memory consumption before and after this change when the number of columns is 10, 100, 1000?

sfc-gh-stan · 2023-09-08T22:32:31Z

Performance test results

Main

(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 100 -c 20 -m -s            
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=100, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.9 MiB    117.9 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.9 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    120.6 MiB     -0.2 MiB         101       for i in range(ncalls):
    44    120.6 MiB      2.5 MiB         100           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    120.6 MiB     -0.0 MiB           1       return df

Client side time elapsed:  4.480346918106079
Time elapsed per API call:  0.04480346918106079
SQL execution time elapsed:  0.12990498542785645
Total execution time elapsed:  4.6102519035339355
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 200 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=200, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.7 MiB    117.7 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.7 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    123.9 MiB     -5.4 MiB         201       for i in range(ncalls):
    44    123.9 MiB      0.8 MiB         200           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    123.9 MiB      0.0 MiB           1       return df

Client side time elapsed:  17.073092937469482
Time elapsed per API call:  0.08536546468734742
SQL execution time elapsed:  0.14729690551757812
Total execution time elapsed:  17.22038984298706
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 400 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=400, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.6 MiB    118.6 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.7 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    134.0 MiB     -7.6 MiB         401       for i in range(ncalls):
    44    134.0 MiB      7.8 MiB         400           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    134.0 MiB      0.0 MiB           1       return df

Client side time elapsed:  68.5808379650116
Time elapsed per API call:  0.171452094912529
SQL execution time elapsed:  0.711745023727417
Total execution time elapsed:  69.29258298873901
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 800 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=800, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.0 MiB    118.0 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.0 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    142.5 MiB     -4.9 MiB         801       for i in range(ncalls):
    44    142.5 MiB     19.7 MiB         800           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    142.5 MiB      0.0 MiB           1       return df

Client side time elapsed:  278.7850019931793
Time elapsed per API call:  0.34848125249147416
SQL execution time elapsed:  0.8526611328125
Total execution time elapsed:  279.6376631259918

fix-df-copy-and-diamond-shaped-joins-for-simplifier

(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 100 -c 20 -m -s             
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=100, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    119.1 MiB    119.1 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    119.1 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    121.6 MiB     -0.4 MiB         101       for i in range(ncalls):
    44    121.6 MiB      2.1 MiB         100           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    121.6 MiB      0.0 MiB           1       return df

Client side time elapsed:  4.90127420425415
Time elapsed per API call:  0.0490127420425415
SQL execution time elapsed:  0.17809391021728516
Total execution time elapsed:  5.0793681144714355
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 200 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=200, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    118.6 MiB    118.6 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    118.6 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    125.1 MiB     -4.1 MiB         201       for i in range(ncalls):
    44    125.1 MiB      2.5 MiB         200           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    125.1 MiB      0.0 MiB           1       return df

Client side time elapsed:  18.77964496612549
Time elapsed per API call:  0.09389822483062744
SQL execution time elapsed:  0.26385974884033203
Total execution time elapsed:  19.04350471496582
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 400 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=400, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.8 MiB    117.8 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.8 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    134.3 MiB      0.0 MiB         401       for i in range(ncalls):
    44    134.3 MiB     16.5 MiB         400           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    134.3 MiB      0.0 MiB           1       return df

Client side time elapsed:  72.3506338596344
Time elapsed per API call:  0.180876584649086
SQL execution time elapsed:  0.540585994720459
Total execution time elapsed:  72.89121985435486
(devenv) stan@XNJJ67DMFL perf % python perf_runner.py with_column 800 -c 20 -m -s
Snowpark Python API Performance Test
Parameters:  Namespace(api='with_column', columns=20, memory=True, ncalls=800, simplify=True)
Filename: perf_runner.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    117.4 MiB    117.4 MiB           1   def with_column(session: Session, ncalls: int) -> DataFrame:
    42    117.4 MiB      0.0 MiB           1       df = session.sql("select 1 as a")
    43    140.7 MiB     -0.0 MiB         801       for i in range(ncalls):
    44    140.7 MiB     23.3 MiB         800           df = df.with_column(f"{'a' * 50}{i}", (col("a") + 1))
    45    140.7 MiB      0.0 MiB           1       return df

Client side time elapsed:  293.10107493400574
Time elapsed per API call:  0.36637634366750715
SQL execution time elapsed:  0.8822462558746338
Total execution time elapsed:  293.98332118988037

…plifier

sfc-gh-stan requested a review from a team as a code owner August 14, 2023 15:43

sfc-gh-stan requested review from sfc-gh-lspiegelberg, sfc-gh-aalam and sfc-gh-yixie August 14, 2023 15:43

sfc-gh-stan changed the title ~~SNOW-841405 Fix df copy and enable diamond shaped joins for simplifier~~ SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier Aug 14, 2023

sfc-gh-aalam reviewed Aug 14, 2023

View reviewed changes

sfc-gh-sfan approved these changes Aug 14, 2023

View reviewed changes

sfc-gh-stan added 4 commits September 8, 2023 19:28

Add changes

4327cb7

Fix df.alias tests

35a54dc

Refine

9a99bd7

Address comments and add changelog

40cf9e4

sfc-gh-stan force-pushed the fix-df-copy-and-diamond-shaped-joins-for-simplifier branch from a7ea86b to 40cf9e4 Compare September 8, 2023 23:35

sfc-gh-yixie approved these changes Sep 9, 2023

View reviewed changes

Merge branch 'main' into fix-df-copy-and-diamond-shaped-joins-for-sim…

d2e1100

…plifier

sfc-gh-stan enabled auto-merge (squash) September 11, 2023 14:14

sfc-gh-stan merged commit d212d69 into main Sep 11, 2023
50 checks passed

sfc-gh-stan deleted the fix-df-copy-and-diamond-shaped-joins-for-simplifier branch September 11, 2023 14:53

github-actions bot locked and limited conversation to collaborators Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

sfc-gh-stan commented Aug 14, 2023 •

edited

Loading

sfc-gh-aalam Aug 14, 2023 •

edited

Loading

sfc-gh-stan Sep 8, 2023

sfc-gh-sfan left a comment

sfc-gh-sfan Aug 14, 2023

sfc-gh-stan Sep 8, 2023

sfc-gh-yixie commented Sep 5, 2023

sfc-gh-stan commented Sep 8, 2023

		assert df1.a._expression.expr_id != df2.a._expression.expr_id

SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

SNOW-841405 Fix df copy and enable basic diamond shaped joins for simplifier #1003

Conversation

sfc-gh-stan commented Aug 14, 2023 • edited Loading

sfc-gh-aalam Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-stan Sep 8, 2023

Choose a reason for hiding this comment

sfc-gh-sfan left a comment

Choose a reason for hiding this comment

sfc-gh-sfan Aug 14, 2023

Choose a reason for hiding this comment

sfc-gh-stan Sep 8, 2023

Choose a reason for hiding this comment

sfc-gh-yixie commented Sep 5, 2023

sfc-gh-stan commented Sep 8, 2023

Performance test results

sfc-gh-stan commented Aug 14, 2023 •

edited

Loading

sfc-gh-aalam Aug 14, 2023 •

edited

Loading