Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1119855: Remove modin/pandas/base.py (2/2) #2167

Merged
merged 14 commits into from
Aug 30, 2024
Merged

Conversation

sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Aug 26, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1119855

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

This followup to #2059 removes our vendored copy of base.py altogether from the codebase. Many methods are still overridden in snowflake/snowpark/modin/plugin/extensions/base_overrides.py, with the reason for overriding each given inline in the comments. These methods are as follows:

list
  • agg/aggregate
  • _agg_helper
  • various aggregations (count, max, min, mean, median, std, sum, var)
  • _binary_op
  • _dropna
  • fillna
  • isin
  • quantile
  • _to_series_list
  • shift
  • skew
  • resample
  • expanding
  • rolling
  • indexer properties (iloc, loc, iat, at)
  • __getitem__
  • sort_values
  • where
  • mask
  • to_csv
  • sample
  • pct_change
  • astype
  • drop
  • __len__
  • set_axis
  • describe
  • diff
  • tail
  • idxmax
  • idxmin
  • unary operators (abs, __invert__, __neg__)
  • rename_axis
  • __array__ufunc__
  • reindex
  • _get_index
  • _set_index

Some of these differences can be upstreamed fairly easily, and we will work to upstream them once the updated modin build process for Snowflake becomes clearer. Some methods will require significantly more work to reconcile.

@sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner August 26, 2024 21:41
@sfc-gh-joshi sfc-gh-joshi added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs labels Aug 26, 2024
Copy link
Contributor

@sfc-gh-rdurrani sfc-gh-rdurrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall - left a minor nit, and a minor question (just to confirm my understanding). Would approve, but I think we may need to add testing/annotations in order to get the coverage test passing!

src/snowflake/snowpark/modin/pandas/dataframe.py Outdated Show resolved Hide resolved
@@ -108,7 +133,7 @@
],
apilink="pandas.Series",
)
class Series(BasePandasDataset):
class Series(BasePandasDataset, metaclass=TelemetryMeta):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we will have to overwrite the class definition regardless, otherwise the telemetry will not work. Or is modin providing any telemetry infra for use? i recall @sfc-gh-azhan and @sfc-gh-nkrishna was doing something for telemetry. I don't think we need to block on this pr for this, let's follow up offline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modin has its own telemetry mechanisms, but there's not really a good way for us to hook into it to send data back to Snowflake.

As far as I can tell, the TelemetryMeta class basically wraps every method/property of the class in a flavor of snowpark_pandas_telemetry_decorator, so once we fully use the modin frontend, we could monkeypatch those methods at runtime to have this behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recall that you mentioned that we need to overwrite the class definition for Series, so that it can take snowpandas index as an input, if we also put metaclass=TelemetryMeta when overwrite the class definition, would it work?
if we do it as monkey patch later, how would it look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I referred to was overriding the __init__ method of Series, not the actual class declaration itself. If we were to override the class declaration, it wouldn't be much different from what we're currently doing.

My thinking is that when we remove Series/DF, we do something like this:

for attr_name, attr_value in Series.__dict__:
    register_series_override(attr_name)(snowpark_pandas_telemetry_method(attr_value))

which would overwrite the existing method with a telemetry-enabled version.



def register_base_override(method_name: str):
def decorator(base_method: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this function try to do here ? can yo add some comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream modin doesn't have an equivalent of register_dataframe_override for BasePandasDataset, so this method tries to simulate that behavior. I added a comment to that effect.

def decorator(base_method: Any):
if callable(base_method) and (
not method_name.startswith("_")
or (method_name in TELEMETRY_PRIVATE_METHODS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, now since the base class is not declared with metaclass=TelemetryMeta, does that mean we will miss telemetry the comes from the base class? if that is the case, this could be a regression for telemetry cc @sfc-gh-azhan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-joshi Let's add a test to make sure we have telemetry for those cases. E.g., series.mean which does not have a series mean method but only use the base's mean method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. mean is actually overridden here, so it gets our telemetry annotation, but I've added a test for copy, which is inherited straight from BasePandasDataset. I had to add a little extra logic to TelemetryMeta to make it work, and I'll revisit it when I remove dataframe/series.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iloc is overridden within this file, and it already has an existing test in test_telemetry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-joshi Does that mean for a method that is just defined in base, if they are not overridden here with the register_base_override annotation, we will not have the telemetry. Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they are not overridden here with the register_base_override annotation, we will not have the telemetry.

Yes, though for now, I've pushed a change to TelemetryMeta to force Series/DataFrame to apply the telemetry annotation to methods they inherit from BasePandasDataset.

Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?

As I mentioned above: one possible approach would be to use the plugin system to iterate over all of Series/DF's methods, and replace them with a telemetry-annotated version.


Returns
-------
BasePandasDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally that seems in base but now it is duplicated in both series and dataframe, can this be in base_overrides to avoid duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream modin doesn't define __array_function__ on any modin.pandas class, so we categorize it as an "extension" (new method in Snowpark pandas) rather than an "override" (new implementation of existing method). The other such extension methods we define here (to_snowflake, to_snowpark, to_pandas) are also identical for DF and Series, so I'm keeping with the pattern here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.
However, the _array_function here seems a pure implementation duplication here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can define a __array_function__ in a new file called base_extension.py or extension_utils.py and reuse the code right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.

Makes sense, I'll separate it out.

we can define a array_function in a new file called base_extension.py or extension_utils.py and reuse the code right?

Added base_extensions.py.



DOC_OVERRIDE_XFAIL_REASON = (
"test docstring overrides currently cannot override real docstring overrides until "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does that mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my comment above, on _init_doc_module:

We use base.py from upstream modin, so we need to initialize its doc module
However, since using the environment variable causes an importlib.reload call,
we need to manually call _inherit_docstrings (https://github.com/modin-project/modin/issues/7138)

As a workaround for upstream modin bugs, we use our own _inherit_docstrings instead of the upstream
function. We accordingly need to clear the docstring dictionary in testing because
we manually called the annotation on initializing snowflake.snowpark.modin.pandas.
snowflake.snowpark.modin.utils._attributes_with_docstrings_replaced.clear()
TODO: once modin 0.31.0 is available, use the actual modin DocModule class

Basically, our internal _inherit_docstrings class keeps track of which (class, method) tuples have already had their methods replaced. Because we already performed an _inherit_docstrings call on BasePandasDataset.astype (or whatever other docstring tests in this file are trying to replace), it will see that the docstring was already previously overridden and fail to do the replacement. I tried messing around with clearing the dictionary tracking docstring replacements, but that caused issues with other docstring tests getting run in the same environment.

This previously was not a problem because we did not perform direct calls to _inherit_docstrings, but since we now use the version of BasePandasDataset provided to us by upstream Modin, we need to do this call upon module initialization until we start using a version of Modin that doesn't call importlib.reload in its DocModule class.

I actually had the Modin version that has the fix for this incorrect in the comment below: it should be 0.31.0, not 0.30.1.

def decorator(base_method: Any):
if callable(base_method) and (
not method_name.startswith("_")
or (method_name in TELEMETRY_PRIVATE_METHODS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-joshi Let's add a test to make sure we have telemetry for those cases. E.g., series.mean which does not have a series mean method but only use the base's mean method.

@sfc-gh-joshi
Copy link
Contributor Author

LGTM overall - left a minor nit, and a minor question (just to confirm my understanding). Would approve, but I think we may need to add testing/annotations in order to get the coverage test passing!

@sfc-gh-rdurrani sounds good, I'll do a pass on coverage after this CI run.


Returns
-------
BasePandasDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.
However, the _array_function here seems a pure implementation duplication here

@@ -108,7 +133,7 @@
],
apilink="pandas.Series",
)
class Series(BasePandasDataset):
class Series(BasePandasDataset, metaclass=TelemetryMeta):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recall that you mentioned that we need to overwrite the class definition for Series, so that it can take snowpandas index as an input, if we also put metaclass=TelemetryMeta when overwrite the class definition, would it work?
if we do it as monkey patch later, how would it look like?

attr_dict = dict(attrs.items())
# If BasePandasDataset, defined exclusively by upstream modin, is a parent of this class,
# then apply the telemetry decorator to it.
# https://stackoverflow.com/a/71105206
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that something newly added, i am not quite getting what are we trying to handle here, do you have an example about what kind of telemetry case we are handling here? is that for new property added to base_extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just added this.

Apparently, the attrs argument passed to the metaclass only contains attributes that are directly defined on the class. For example, we define abs in series.py, so abs shows up in the old implementation of this metaclass, and gets the telemetry annotation added.

Methods that are declared on upstream Modin's BasePandasDataset and NOT redefined on Series/DF, such as copy, aren't visited by this method in our old implementation. As such, methods like copy wouldn't get the telemetry decorator. The new changes I added basically forcibly add telemetry to methods inherited from BasePandasDataset.

def _get_columns(self) -> native_pd.Index:
return self._modin_frame.data_columns_index

columns: native_pd.Index = property(_get_columns, set_columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what would be the correct way to do this in the future, i think the same problem also applies to index. in general, i would think having get_columns, and set_columns two functions implemented at query compiler layer is very clear, and the frontend layer should call those method correctly

def decorator(base_method: Any):
if callable(base_method) and (
not method_name.startswith("_")
or (method_name in TELEMETRY_PRIVATE_METHODS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-joshi Does that mean for a method that is just defined in base, if they are not overridden here with the register_base_override annotation, we will not have the telemetry. Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?

@sfc-gh-joshi
Copy link
Contributor Author

@sfc-gh-yzou @sfc-gh-azhan I made some changes to telemetry to more closely aligned with what's necessary after we remove series.py/dataframe.py as well. In the previous version of the PR, I added some overrides in TelemetryMeta (used in our vendored Series and DataFrame classes to check methods defined on the parent BasePandasDataset as well, but this was causing issues with extension functions, and would have to change once we stop vendoring those frontend methods as well.

Instead, I've added a pass in snowflake/snowpark/modin/pandas/__init__.py that runs after extensions functions are initialized, which examines all non-extension functions defined by BasePandasDataset and replaces them with telemetry-annotated versions via the DF/Series override system. We will be able to continue using this method (with some slight changes) once we remove our vendored copies of dataframe.py and series.py.

Copy link
Collaborator

@sfc-gh-azhan sfc-gh-azhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sfc-gh-joshi sfc-gh-joshi merged commit 25c1006 into main Aug 30, 2024
35 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi/rip-base branch August 30, 2024 19:01
@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants