Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1489371: Implement GroupBy.value_counts #1986

Merged
merged 21 commits into from
Aug 30, 2024

Conversation

sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Jul 25, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1489371

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

This PR adds support for GroupBy.value_counts, accepting all parameters except bin, which we do not support for DataFrame/Series.value_counts.

Upstream modin defaults to pandas for both DataFrameGroupBy/SeriesGroupBy.value_counts, so some of these changes should be eventually upstreamed.

pandas has different behavior than what might be expected from documentation; this PR tries to align with existing behavior as much as possible. This is documented in this pandas issue: pandas-dev/pandas#59307

  1. When normalize=True, pandas sorts by the pre-normalization counts, leading to counterintuitive results. This only matters when groupby is called with sort=False and value_counts with sort=True. See test cases for an example.
  2. pandas does not always respect the original order of data, depending on the configuration of sort flags in groupby and the value_counts call itself. The behaviors are as follows (copied from query compiler comments):
# pandas currently provides the following behaviors based on the different sort flags.
# These behaviors are not entirely consistent with documentation; see this issue for discussion:
# https://github.com/pandas-dev/pandas/issues/59307
#
# Example data (using pandas 2.2.1 behavior):
# >>> df = pd.DataFrame({"X": ["B", "A", "A", "B", "B", "B"], "Y": [4, 1, 3, -2, -1, -1]})
#
# 1. groupby(sort=True).value_counts(sort=True)
#   Sort on non-grouping columns, then sort on frequencies, then sort on grouping columns.
# >>> df.groupby("X", sort=True).value_counts(sort=True)
# X  Y
# A   1    1
#     3    1
# B  -1    2
#    -2    1
#     4    1
# Name: count, dtype: int64
#
# 2. groupby(sort=True).value_counts(sort=False)
#   Sort on non-grouping columns, then sort on grouping columns.
# >>> df.groupby("X", sort=True).value_counts(sort=True)
# X  Y
# X  Y
# A   1    1
#     3    1
# B  -2    1
#    -1    2
#     4    1
# Name: count, dtype: int64
#
# 3. groupby(sort=False).value_counts(sort=True)
#   Sort on frequencies.
# >>> df.groupby("X", sort=False).value_counts(sort=True)
# X  Y
# B  -1    2
#     4    1
# A   1    1
#     3    1
# B  -2    1
# Name: count, dtype: int64
#
# 4. groupby(sort=False).value_counts(sort=False)
#   Sort on nothing (entries match the order of the original frame).
# X  Y
# B   4    1
# A   1    1
#     3    1
# B  -2    1
#    -1    2
# Name: count, dtype: int64
#
# Lastly, when `normalize` is set with groupby(sort=False).value_counts(sort=True, normalize=True),
# pandas will sort by the pre-normalization counts rather than the resulting proportions. As this
# is an uncommon edge case, we cannot handle this using existing QC methods efficiently, so we just
# update our testing code to account for this.
# See comment 

@sfc-gh-joshi sfc-gh-joshi marked this pull request as ready for review July 26, 2024 17:00
@sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner July 26, 2024 17:00
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1489371-groupby-value_count branch from 52086f6 to 2d23b99 Compare July 26, 2024 17:27
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1489371-groupby-value_count branch 2 times, most recently from 7514121 to d719298 Compare August 8, 2024 23:47
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1489371-groupby-value_count branch from d719298 to bc5962b Compare August 28, 2024 00:26
@sfc-gh-joshi sfc-gh-joshi changed the title SNOW-1489371: Implement GroupBy.value_count SNOW-1489371: Implement GroupBy.value_counts Aug 28, 2024
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1489371-groupby-value_count branch 2 times, most recently from 8a57df9 to caadf9f Compare August 29, 2024 22:23
docs/source/modin/supported/groupby_supported.rst Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
Copy link
Contributor

@sfc-gh-vbudati sfc-gh-vbudati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Jonathan! This seems like a monstrous task, thank you for having such detailed documentation about it. I left a couple comments and I think it's good to merge after.

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1489371-groupby-value_count branch from 1336993 to e48dfe3 Compare August 30, 2024 20:03
@sfc-gh-joshi sfc-gh-joshi enabled auto-merge (squash) August 30, 2024 20:14
@sfc-gh-joshi sfc-gh-joshi merged commit 20837fc into main Aug 30, 2024
35 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi-SNOW-1489371-groupby-value_count branch August 30, 2024 21:23
@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants