-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1119855: Remove modin/pandas/base.py (2/2) #2167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall - left a minor nit, and a minor question (just to confirm my understanding). Would approve, but I think we may need to add testing/annotations in order to get the coverage test passing!
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Show resolved
Hide resolved
@@ -108,7 +133,7 @@ | |||
], | |||
apilink="pandas.Series", | |||
) | |||
class Series(BasePandasDataset): | |||
class Series(BasePandasDataset, metaclass=TelemetryMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we will have to overwrite the class definition regardless, otherwise the telemetry will not work. Or is modin providing any telemetry infra for use? i recall @sfc-gh-azhan and @sfc-gh-nkrishna was doing something for telemetry. I don't think we need to block on this pr for this, let's follow up offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modin has its own telemetry mechanisms, but there's not really a good way for us to hook into it to send data back to Snowflake.
As far as I can tell, the TelemetryMeta
class basically wraps every method/property of the class in a flavor of snowpark_pandas_telemetry_decorator
, so once we fully use the modin frontend, we could monkeypatch those methods at runtime to have this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recall that you mentioned that we need to overwrite the class definition for Series, so that it can take snowpandas index as an input, if we also put metaclass=TelemetryMeta when overwrite the class definition, would it work?
if we do it as monkey patch later, how would it look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I referred to was overriding the __init__
method of Series
, not the actual class declaration itself. If we were to override the class declaration, it wouldn't be much different from what we're currently doing.
My thinking is that when we remove Series/DF, we do something like this:
for attr_name, attr_value in Series.__dict__:
register_series_override(attr_name)(snowpark_pandas_telemetry_method(attr_value))
which would overwrite the existing method with a telemetry-enabled version.
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Show resolved
Hide resolved
|
||
|
||
def register_base_override(method_name: str): | ||
def decorator(base_method: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this function try to do here ? can yo add some comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream modin doesn't have an equivalent of register_dataframe_override
for BasePandasDataset, so this method tries to simulate that behavior. I added a comment to that effect.
def decorator(base_method: Any): | ||
if callable(base_method) and ( | ||
not method_name.startswith("_") | ||
or (method_name in TELEMETRY_PRIVATE_METHODS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, now since the base class is not declared with metaclass=TelemetryMeta, does that mean we will miss telemetry the comes from the base class? if that is the case, this could be a regression for telemetry cc @sfc-gh-azhan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-joshi Let's add a test to make sure we have telemetry for those cases. E.g., series.mean
which does not have a series mean method but only use the base's mean method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. mean
is actually overridden here, so it gets our telemetry annotation, but I've added a test for copy
, which is inherited straight from BasePandasDataset
. I had to add a little extra logic to TelemetryMeta
to make it work, and I'll revisit it when I remove dataframe/series.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iloc
is overridden within this file, and it already has an existing test in test_telemetry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-joshi Does that mean for a method that is just defined in base, if they are not overridden here with the register_base_override annotation, we will not have the telemetry. Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they are not overridden here with the register_base_override annotation, we will not have the telemetry.
Yes, though for now, I've pushed a change to TelemetryMeta
to force Series/DataFrame to apply the telemetry annotation to methods they inherit from BasePandasDataset.
Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?
As I mentioned above: one possible approach would be to use the plugin system to iterate over all of Series/DF's methods, and replace them with a telemetry-annotated version.
|
||
Returns | ||
------- | ||
BasePandasDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally that seems in base but now it is duplicated in both series and dataframe, can this be in base_overrides to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream modin doesn't define __array_function__
on any modin.pandas
class, so we categorize it as an "extension" (new method in Snowpark pandas) rather than an "override" (new implementation of existing method). The other such extension methods we define here (to_snowflake
, to_snowpark
, to_pandas
) are also identical for DF and Series, so I'm keeping with the pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.
However, the _array_function here seems a pure implementation duplication here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can define a __array_function__
in a new file called base_extension.py
or extension_utils.py
and reuse the code right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.
Makes sense, I'll separate it out.
we can define a array_function in a new file called base_extension.py or extension_utils.py and reuse the code right?
Added base_extensions.py
.
|
||
|
||
DOC_OVERRIDE_XFAIL_REASON = ( | ||
"test docstring overrides currently cannot override real docstring overrides until " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does that mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my comment above, on _init_doc_module
:
We use base.py from upstream modin, so we need to initialize its doc module
However, since using the environment variable causes an importlib.reload call,
we need to manually call _inherit_docstrings (https://github.com/modin-project/modin/issues/7138)
As a workaround for upstream modin bugs, we use our own _inherit_docstrings instead of the upstream
function. We accordingly need to clear the docstring dictionary in testing because
we manually called the annotation on initializing snowflake.snowpark.modin.pandas.
snowflake.snowpark.modin.utils._attributes_with_docstrings_replaced.clear()
TODO: once modin 0.31.0 is available, use the actual modin DocModule class
Basically, our internal _inherit_docstrings
class keeps track of which (class, method)
tuples have already had their methods replaced. Because we already performed an _inherit_docstrings
call on BasePandasDataset.astype
(or whatever other docstring tests in this file are trying to replace), it will see that the docstring was already previously overridden and fail to do the replacement. I tried messing around with clearing the dictionary tracking docstring replacements, but that caused issues with other docstring tests getting run in the same environment.
This previously was not a problem because we did not perform direct calls to _inherit_docstrings
, but since we now use the version of BasePandasDataset
provided to us by upstream Modin, we need to do this call upon module initialization until we start using a version of Modin that doesn't call importlib.reload
in its DocModule
class.
I actually had the Modin version that has the fix for this incorrect in the comment below: it should be 0.31.0, not 0.30.1.
def decorator(base_method: Any): | ||
if callable(base_method) and ( | ||
not method_name.startswith("_") | ||
or (method_name in TELEMETRY_PRIVATE_METHODS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-joshi Let's add a test to make sure we have telemetry for those cases. E.g., series.mean
which does not have a series mean method but only use the base's mean method.
9bc7743
to
4bcc47d
Compare
@sfc-gh-rdurrani sounds good, I'll do a pass on coverage after this CI run. |
|
||
Returns | ||
------- | ||
BasePandasDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at to_snowflake, to_snowpark, and to_pandas, the fact they are defined in both series and daframe are for documentation purpose, they need slightly different description and example. The implementation is pretty much all reused.
However, the _array_function here seems a pure implementation duplication here
@@ -108,7 +133,7 @@ | |||
], | |||
apilink="pandas.Series", | |||
) | |||
class Series(BasePandasDataset): | |||
class Series(BasePandasDataset, metaclass=TelemetryMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recall that you mentioned that we need to overwrite the class definition for Series, so that it can take snowpandas index as an input, if we also put metaclass=TelemetryMeta when overwrite the class definition, would it work?
if we do it as monkey patch later, how would it look like?
attr_dict = dict(attrs.items()) | ||
# If BasePandasDataset, defined exclusively by upstream modin, is a parent of this class, | ||
# then apply the telemetry decorator to it. | ||
# https://stackoverflow.com/a/71105206 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that something newly added, i am not quite getting what are we trying to handle here, do you have an example about what kind of telemetry case we are handling here? is that for new property added to base_extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just added this.
Apparently, the attrs
argument passed to the metaclass only contains attributes that are directly defined on the class. For example, we define abs
in series.py
, so abs
shows up in the old implementation of this metaclass, and gets the telemetry annotation added.
Methods that are declared on upstream Modin's BasePandasDataset
and NOT redefined on Series/DF, such as copy
, aren't visited by this method in our old implementation. As such, methods like copy
wouldn't get the telemetry decorator. The new changes I added basically forcibly add telemetry to methods inherited from BasePandasDataset
.
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Show resolved
Hide resolved
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Show resolved
Hide resolved
def _get_columns(self) -> native_pd.Index: | ||
return self._modin_frame.data_columns_index | ||
|
||
columns: native_pd.Index = property(_get_columns, set_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what would be the correct way to do this in the future, i think the same problem also applies to index. in general, i would think having get_columns, and set_columns two functions implemented at query compiler layer is very clear, and the frontend layer should call those method correctly
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Show resolved
Hide resolved
def decorator(base_method: Any): | ||
if callable(base_method) and ( | ||
not method_name.startswith("_") | ||
or (method_name in TELEMETRY_PRIVATE_METHODS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-joshi Does that mean for a method that is just defined in base, if they are not overridden here with the register_base_override annotation, we will not have the telemetry. Till the end, we will align most implementation in this file with modin, and do not have much override here, what happens then?
df93035
to
74594d9
Compare
@sfc-gh-yzou @sfc-gh-azhan I made some changes to telemetry to more closely aligned with what's necessary after we remove series.py/dataframe.py as well. In the previous version of the PR, I added some overrides in Instead, I've added a pass in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
aeaf47f
to
0956d1c
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1119855
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
This followup to #2059 removes our vendored copy of
base.py
altogether from the codebase. Many methods are still overridden insnowflake/snowpark/modin/plugin/extensions/base_overrides.py
, with the reason for overriding each given inline in the comments. These methods are as follows:list
Some of these differences can be upstreamed fairly easily, and we will work to upstream them once the updated modin build process for Snowflake becomes clearer. Some methods will require significantly more work to reconcile.