Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1654730: Refactor aggregation_utils to reduce duplication and clarify interfaces. #2270

Merged

Conversation

sfc-gh-mvashishtha
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha commented Sep 11, 2024

Fixes SNOW-1654730

Changes to the interface of aggregation_utils

  • get_snowflake_agg_func now returns a NamedTuple, SnowflakeAggFunc, which contains a snowpark aggregation method, along with the bool preserves_snowpark_pandas_types. Formerly, get_snowflake_agg_func would return a Snowpark aggregation and the caller would then use separate dictionaries to deduce how the aggregation affects Snowpark pandas types.
  • Prepend an underscore to the names of several objects in aggregation_utils that are only used internally, e.g. _array_agg_keepna and _PANDAS_AGGREGATION_TO_SNOWPARK_PANDAS_AGGREGATION
  • Formerly, to get axis=1 aggregation, query compiler methods would call generate_rowwise_aggregation_function. Instead, make get_snowflake_agg_func the common interface to get SnowflakeAggFunc even for axis=1, and make generate_rowwise_aggregation_function an internal method called _generate_rowwise_aggregation_function.

Changes to the internals of aggregation_utils

  • Before this commit, there were 5 different maps describing how to translate pandas aggregations to Snowpark: SNOWFLAKE_BUILTIN_AGG_FUNC_MAP mapped from Snowpark pandas aggregations to the axis=0 aggregation; GROUPBY_AGG_PRESERVES_SNOWPARK_PANDAS_TYPE and GROUPBY_AGG_WITH_NONE_SNOWPARK_PANDAS_TYPES told whether the axis=0 aggregations would preserve Snowpark pandas types; SNOWFLAKE_COLUMNS_AGG_FUNC_MAP would tell how to aggregate on axis=1 when skipna=True; and SNOWFLAKE_COLUMNS_KEEPNA_AGG_FUNC_MAP would tell how to aggregate on axis=1 when skipna=False. All of these maps repeated the mapping of pairs like "sum" and np.sum to the same aggregation function. In this commit, keep a single mapping, _PANDAS_AGGREGATION_TO_SNOWPARK_PANDAS_AGGREGATION, from pandas aggregations to instances of the internal tuple _SnowparkPandasAggregation. _SnowparkPandasAggregation includes preserves_snowpark_pandas_type, as well as optionally the aggregation functions for axis=0; axis=1, skipna=False; and axis=1, skipna=True.

New feature

  • As a consequence of the refactoring, groupby().var() no longer raises NotImplementedError, but it's invalid in pandas, so we correctly raise TypeError.

@sfc-gh-mvashishtha sfc-gh-mvashishtha added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Sep 11, 2024
@sfc-gh-mvashishtha sfc-gh-mvashishtha force-pushed the mvashishtha/SNOW-1654730/refactor-aggregation-utils branch from d1e9cb0 to 5486008 Compare September 11, 2024 16:45
@sfc-gh-mvashishtha sfc-gh-mvashishtha force-pushed the mvashishtha/SNOW-1654730/refactor-aggregation-utils branch from 5486008 to 2fc1f15 Compare September 11, 2024 16:46
@sfc-gh-mvashishtha sfc-gh-mvashishtha marked this pull request as ready for review September 11, 2024 19:02
@sfc-gh-mvashishtha sfc-gh-mvashishtha requested a review from a team as a code owner September 11, 2024 19:02
@sfc-gh-mvashishtha sfc-gh-mvashishtha added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Sep 11, 2024
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Copy link
Contributor

@sfc-gh-nkrishna sfc-gh-nkrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor @sfc-gh-mvashishtha , just left one suggestion for an additional test

Copy link
Collaborator

@sfc-gh-azhan sfc-gh-azhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion left and looks good for me!

…pandas aggregation.

Signed-off-by: sfc-gh-mvashishtha <[email protected]>
@sfc-gh-mvashishtha sfc-gh-mvashishtha enabled auto-merge (squash) September 12, 2024 21:29
@sfc-gh-mvashishtha sfc-gh-mvashishtha merged commit f2f4b33 into main Sep 12, 2024
35 checks passed
@sfc-gh-mvashishtha sfc-gh-mvashishtha deleted the mvashishtha/SNOW-1654730/refactor-aggregation-utils branch September 12, 2024 22:14
@github-actions github-actions bot locked and limited conversation to collaborators Sep 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants