Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Local Testing] SNOW-783554 Support selecting window expressions #1049

Merged
merged 11 commits into from
Oct 18, 2023

Conversation

sfc-gh-stan
Copy link
Collaborator

@sfc-gh-stan sfc-gh-stan commented Sep 13, 2023

  • Make non-rank related functions with unspecified window frame use cumulative window by default
  • Support aggregating window expressions
  • Support more rank related functions/Refactor to allow patching for rank related functions (This will be addressed in a separate PR).

@sfc-gh-stan sfc-gh-stan marked this pull request as ready for review September 14, 2023 19:23
@sfc-gh-stan sfc-gh-stan requested a review from a team as a code owner September 14, 2023 19:23
@sfc-gh-stan sfc-gh-stan requested review from sfc-gh-mkeller and sfc-gh-sfan and removed request for a team September 14, 2023 19:23
@sfc-gh-stan sfc-gh-stan force-pushed the local/support-window-functions branch from be640d9 to d07b157 Compare September 19, 2023 17:26
@@ -137,7 +139,7 @@ def mock_avg(column: ColumnEmulator) -> ColumnEmulator:
cnt += 1
# round to 5 according to snowflake spec
ret = (
ColumnEmulator(data=[round((ret / cnt), 5)])
ColumnEmulator(data=[round((ret / cnt), 3)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are other places in this module where we round to 5 (I did this previously)
so do we also need to update other mock implementations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I will address these altogether in a separate PR.

@sfc-gh-stan sfc-gh-stan force-pushed the local/support-window-functions branch from d07b157 to 40a2e7e Compare September 25, 2023 13:38
@sfc-gh-yixie
Copy link
Collaborator

I tried this

from snowflake.snowpark import Session, Window
from snowflake.snowpark.functions import min as min_
from snowflake.snowpark.mock.connection import MockServerConnection

session = Session(MockServerConnection())

df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"])
df.select(min_("b").over(Window.partition_by("a").order_by("b"))).show()
df.select(min_("b").over(Window.partition_by("a").order_by("b").range_between(0, 1))).show()

Two problems:

  1. The first show outputs column name min("B") OVER (PARTITION BY "A" ORDER BY "B" A.... I guess Snowflake uses all upper case.
  2. the 2nd doesn't work. AttributeError: 'numpy.ndarray' object has no attribute 'loc'.

@sfc-gh-stan sfc-gh-stan force-pushed the local/support-window-functions branch from 4934593 to 56b2034 Compare October 5, 2023 19:11
@sfc-gh-stan sfc-gh-stan force-pushed the local/support-window-functions branch from 56b2034 to 71ca36c Compare October 11, 2023 17:38
@@ -285,7 +308,7 @@ def analyze(
self.session._conn._telemetry_client.send_function_usage_telemetry(
expr.api_call_source, TelemetryField.FUNC_CAT_USAGE.value
)
func_name = expr.name.upper() if parse_local_name else expr.name
Copy link
Collaborator Author

@sfc-gh-stan sfc-gh-stan Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change always converts function names to uppercase. cc: @sfc-gh-yixie

@@ -270,6 +270,8 @@ def add_date_and_number(


class ColumnEmulator(pd.Series):
_metadata = ["sf_type"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to preserve sf_type after index operations like reset_index or sort_index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this is used anyhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed by pandas internally to preserve added property in subclasses: https://pandas.pydata.org/docs/development/extending.html#define-original-properties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a comment here?

@@ -251,20 +265,30 @@ def test_range_between_should_accept_at_most_one_order_by_expression_when_unboun
df.select(
min_("key").over(window.range_between(Window.unboundedPreceding, 1))
).collect()
assert "Cumulative window frame unsupported for function MIN" in str(ex_info)
if not local_testing_mode:
assert "Cumulative window frame unsupported for function MIN" in str(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message doesn't make sense. This error is caused by window.range_between(Window.unboundedPreceding, 1), you cannot specify range between for non-cumulative window frames. As an evidence, changing the expression to window.range_between(Window.unboundedPreceding, 0) runs regardless of the number of order expressions.

@@ -283,19 +307,28 @@ def test_range_between_should_accept_numeric_values_only_when_bounded(session):
df.select(
min_("value").over(window.range_between(Window.unboundedPreceding, 1))
).collect()
assert "Cumulative window frame unsupported for function MIN" in str(ex_info)
if not local_testing_mode:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above, the error message is confusing and the actual failure is not caused by numeric values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rewrite these two tests for both live connection and local testing.

@sfc-gh-stan
Copy link
Collaborator Author

I tried this

from snowflake.snowpark import Session, Window
from snowflake.snowpark.functions import min as min_
from snowflake.snowpark.mock.connection import MockServerConnection

session = Session(MockServerConnection())

df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"])
df.select(min_("b").over(Window.partition_by("a").order_by("b"))).show()
df.select(min_("b").over(Window.partition_by("a").order_by("b").range_between(0, 1))).show()

Two problems:

  1. The first show outputs column name min("B") OVER (PARTITION BY "A" ORDER BY "B" A.... I guess Snowflake uses all upper case.
  2. the 2nd doesn't work. AttributeError: 'numpy.ndarray' object has no attribute 'loc'.

Fixed.

@sfc-gh-stan sfc-gh-stan requested review from sfc-gh-yixie, sfc-gh-aling and sfc-gh-jdu and removed request for sfc-gh-mkeller October 11, 2023 17:50
src/snowflake/snowpark/mock/plan.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/mock/plan.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/mock/plan.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/mock/plan.py Show resolved Hide resolved
src/snowflake/snowpark/mock/plan.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/mock/plan.py Outdated Show resolved Hide resolved
) # dtype=object prevents implicit converting None to Nan
res_col.index = res_index
return res_col.sort_index()
elif isinstance(window_function, FirstValue):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels that FirstValue and LastValue calculation are sharing lots of logic
except for the w.iloc[0] vs w.iloc[len(w) - 1] and for cur_idx in range(len(w)): vs for cur_idx in range(len(w) - 1, -1, -1):.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very true. I am planning to refactor this part in a separate PR so all window function definitions can be found in mock/functions.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a JIRA or log it in an existing jira so that we don't forget?

@sfc-gh-stan sfc-gh-stan merged commit 2ad4e3e into dev/local-testing Oct 18, 2023
8 of 39 checks passed
@sfc-gh-stan sfc-gh-stan deleted the local/support-window-functions branch October 18, 2023 22:22
@github-actions github-actions bot locked and limited conversation to collaborators Oct 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants