Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1119855: Push base unimplemented methods to query compiler when possible (1/?) #2059

Merged
merged 8 commits into from
Aug 19, 2024

Conversation

sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Aug 8, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1119855 (first of several PRs)

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

This PR prepares for the imminent removal of base.py from the Snowpark pandas codebase by pushing as many unimplemented methods down to the query compiler as possible. Some unimplemented methods must still be overridden because Modin uses other implemented query compiler/frontend methods in their implementation; these methods have been moved to src/snowflake/snowpark/modin/plugin/extensions/base_overrides.py.

The following methods are affected:

The below BasePandasDataset methods either have existing corresponding query compiler stubs that can be used directly, or default to pandas at the query compiler layer, which Snowpark pandas specifies to raise a NotImplementedError.

Pushed to the query compiler
  • align
  • at_time (via between_time)
  • between_time
  • clip
  • combine (via binary_op)
  • combine_first (via binary_op)
  • explode
  • infer_objects
  • kurt
  • kurtosis (via kurt)
  • mode
  • sem
  • tz_convert (via tz_convert)
  • tz_localize (via tz_localize)
  • __sizeof__

The below methods do not have corresponding query compiler methods, and either (1) would work in Snowpark pandas but haven't been tested/need too many queries, (2) use frontend/other query compiler methods that we cannot override, or (3) manipulates index/columns in ways that we may not support.

Moved to base_overrides.py
  • asof: modin uses dropna + reindex
  • bool: method is deprecated in pandas but pretty trivial
  • droplevel: modin does manipulations on column/index objects
  • ewm: modin defaults at frontend, would materialize data
  • filter: modin uses loc + manipulations on column/index
  • pipe: just kind of weird, probably would materialize data
  • pop: modin uses loc + del
  • reorder_levels: modin uses properties on index object
  • set_flags: modin defaults at frontend, would materialize data
  • swapaxes: modin uses transpose + index/column manipulations
  • swaplevel: modin uses native index swaplevel
  • transform: modin uses QC agg method
  • truncate: modin manipulates index objects
  • __finalize__: modin defaults at forntend, would materialize data
  • These methods are also moved to base_overrides.py.

    I/O methods that default on the frontend and would materialize data
  • to_clipboard
  • to_excel
  • to_hdf
  • to_json
  • to_latex
  • to_markdown
  • to_pickle
  • to_string
  • to_sql
  • to_timestamp
  • to_xarray
  • Other
  • __array_wrap__: this was removed in pandas 2.0
  • reindex_like: at some point, this method was moved from base.py into dataframe.py/series.py upstream; we continue to mark these as unimplemented in our vendored copies of dataframe/series for now
  • map: this was also split into dataframe.py/series.py; since we partially support the method for series (but not currently DF) we mark this method as unimplemented only in dataframe.py
  • transform: we partially support the method for DF (but not for series), so we add an override in dataframe_overrides.py with an implementation and an unimplemented stub in series_overrides.py
  • @sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner August 8, 2024 23:08
    @sfc-gh-joshi sfc-gh-joshi added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs labels Aug 8, 2024
    @sfc-gh-joshi sfc-gh-joshi force-pushed the joshi/push-base-unimplemented branch 2 times, most recently from ea5bb70 to 649610f Compare August 12, 2024 22:21
    @sfc-gh-azhan
    Copy link
    Collaborator

    You can move base.py out of the code coverage check given it will be removed soon.

    @sfc-gh-joshi
    Copy link
    Contributor Author

    You can move base.py out of the code coverage check given it will be removed soon.

    If I modify tox.ini (which seems to be where files get excluded) it'll trigger a review request on the Python team, so I'll just add manual # pragma: no covers for now.

    Copy link
    Contributor

    @sfc-gh-rdurrani sfc-gh-rdurrani left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM, great work Jonathan!

    @sfc-gh-joshi sfc-gh-joshi force-pushed the joshi/push-base-unimplemented branch from 598bd9c to 18dfe8a Compare August 19, 2024 22:39
    @sfc-gh-joshi sfc-gh-joshi merged commit 6cd4a97 into main Aug 19, 2024
    35 checks passed
    @sfc-gh-joshi sfc-gh-joshi deleted the joshi/push-base-unimplemented branch August 19, 2024 23:14
    @github-actions github-actions bot locked and limited conversation to collaborators Aug 19, 2024
    Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
    Labels
    NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants