-
Notifications
You must be signed in to change notification settings - Fork 116
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SNOW-1489371: Implement GroupBy.value_counts (#1986)
<!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1489371 2. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. 3. Please describe how your code solves the related issue. This PR adds support for GroupBy.value_counts, accepting all parameters except `bin`, which we do not support for DataFrame/Series.value_counts. Upstream modin defaults to pandas for both DataFrameGroupBy/SeriesGroupBy.value_counts, so some of these changes should be eventually upstreamed. pandas has different behavior than what might be expected from documentation; this PR tries to align with existing behavior as much as possible. This is documented in this pandas issue: pandas-dev/pandas#59307 1. When `normalize=True`, pandas sorts by the pre-normalization counts, leading to counterintuitive results. This only matters when `groupby` is called with `sort=False` and `value_counts` with `sort=True`. See test cases for an example. 2. pandas does not always respect the original order of data, depending on the configuration of sort flags in `groupby` and the `value_counts` call itself. The behaviors are as follows (copied from query compiler comments): ``` # pandas currently provides the following behaviors based on the different sort flags. # These behaviors are not entirely consistent with documentation; see this issue for discussion: # pandas-dev/pandas#59307 # # Example data (using pandas 2.2.1 behavior): # >>> df = pd.DataFrame({"X": ["B", "A", "A", "B", "B", "B"], "Y": [4, 1, 3, -2, -1, -1]}) # # 1. groupby(sort=True).value_counts(sort=True) # Sort on non-grouping columns, then sort on frequencies, then sort on grouping columns. # >>> df.groupby("X", sort=True).value_counts(sort=True) # X Y # A 1 1 # 3 1 # B -1 2 # -2 1 # 4 1 # Name: count, dtype: int64 # # 2. groupby(sort=True).value_counts(sort=False) # Sort on non-grouping columns, then sort on grouping columns. # >>> df.groupby("X", sort=True).value_counts(sort=True) # X Y # X Y # A 1 1 # 3 1 # B -2 1 # -1 2 # 4 1 # Name: count, dtype: int64 # # 3. groupby(sort=False).value_counts(sort=True) # Sort on frequencies. # >>> df.groupby("X", sort=False).value_counts(sort=True) # X Y # B -1 2 # 4 1 # A 1 1 # 3 1 # B -2 1 # Name: count, dtype: int64 # # 4. groupby(sort=False).value_counts(sort=False) # Sort on nothing (entries match the order of the original frame). # X Y # B 4 1 # A 1 1 # 3 1 # B -2 1 # -1 2 # Name: count, dtype: int64 # # Lastly, when `normalize` is set with groupby(sort=False).value_counts(sort=True, normalize=True), # pandas will sort by the pre-normalization counts rather than the resulting proportions. As this # is an uncommon edge case, we cannot handle this using existing QC methods efficiently, so we just # update our testing code to account for this. # See comment ``` --------- Co-authored-by: Andong Zhan <[email protected]>
- Loading branch information
1 parent
19538ef
commit 20837fc
Showing
8 changed files
with
573 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.