Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1661142 Fix index name behavior #2274

Merged
merged 13 commits into from
Sep 17, 2024

Conversation

sfc-gh-vbudati
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
    Fixes SNOW-1661142

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.
    Fixed a bug where updating an index's name updates the parent's index name when it is not supposed to. This is done by verifying that the query_compiler recorded during the index's creation matches that of its parent object when the parent object must be updated.

>>> df = pd.DataFrame(
...     {
...         "A": [0, 1, 2, 3, 4, 4],
...         "B": ['a', 'b', 'c', 'd', 'e', 'f'],
...     },
...     index = pd.Index([1, 2, 3, 4, 5, 6], name = "test"),
... )
>>> index = df.index
>>> df
      A  B
test      
1     0  a
2     1  b
3     2  c
4     3  d
5     4  e
6     4  f
>>> index.name = "test2"
>>> 
>>> df
       A  B
test2      
1      0  a
2      1  b
3      2  c
4      3  d
5      4  e
6      4  f
>>> df.dropna(inplace=True)
>>> index.name = "test3"
>>> df
       A  B
test2        # <--- name should not update      
1      0  a
2      1  b
3      2  c
4      3  d
5      4  e
6      4  f

For the full discussion, see thread: https://docs.google.com/document/d/1vdllzNgeUHMiffFNpm9SD1HOYUk8lkMVp14HQDoqr7s/edit?disco=AAABVbKjFJ0

@sfc-gh-vbudati sfc-gh-vbudati added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Sep 11, 2024
@sfc-gh-yzou sfc-gh-yzou requested review from sfc-gh-azhan and sfc-gh-yzou and removed request for sfc-gh-lspiegelberg and sfc-gh-nkumar September 12, 2024 17:11
@@ -163,6 +163,7 @@ def __new__(
index._query_compiler = query_compiler
# `_parent` keeps track of any Series or DataFrame that this Index is a part of.
index._parent = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about get a class like IndexParent, and it records the necessary parent information, the constructor can take the df and record the information property, and it can also provide an interface like no_change_on_parent (you can use other name if you have other ones in mind) to help decide wether a replacement should happen, this could help wrap all parent logic at one place and simplify the code handling in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i have a quick question, shouldn't datetime_index an extension of index class, and it can call super().init to init the common part, right? how come we are initialization the same thing at different place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__ in Index.py is used only for docstring generation, the actual initialization is happening in __new__. I'm not sure why it was designed this way.
But I created a new class like you suggested to abstract away the common logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the new is implemented because the returned object could be different based on the type. i don't think we have to fix that in this pr, just curious why we didn't call super class for the common part? cc @sfc-gh-azhan do you know?

if self._parent is not None:
# Update the name of the parent's index only if the parent's current query compiler
# matches the recorded query compiler (_parent_qc).
if self._parent is not None and self._parent_qc is self._parent._query_compiler:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also now after updating, you should make sure the parent_qc is updated to the new one, you can also wrap that logic in the IndexParent class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -2262,7 +2262,7 @@ def _get_index(self):
return self._query_compiler.index

idx = Index(query_compiler=self._query_compiler)
idx._set_parent(self)
idx._parent.set_parent(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update _set_parent instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually do we need that one line wrapper? the current place seems the only place calls it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it to _set_parent() since that seems like a "safer" way to set the parent details without exposing too much

@@ -71,6 +71,41 @@
}


class IndexParent:
def __init__(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always init with values not None? if index._parent is None, then no parent, if it is not, it should mean parent exists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the init can always take a df as input, index._parent can remain optional, which makes the relationship more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! The init should now take parent and be the only way to set the parent details.

Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-vbudati Left some comment about docs and comments, please take care of it before merging!

CHANGELOG.md Outdated
@@ -12,6 +12,10 @@

- Added support for `TimedeltaIndex.mean` method.

#### Bug Fixes

- Fixed a bug where an `Index` object created from a `Series`/`DataFrame` incorrectly updates the `Series`/`DataFrame`'s index name when it is not supposed to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index name when it is not supposed to." -> index name after an inplace updates have been applied to the original series/dataframe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

def check_and_update_parent_qc_index_names(self, names: list) -> None:
"""
Update the Index and its parent's index names if the parent's current query compiler matches
the recorded query compiler (`_parent_qc`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here that "if the query compiler associated with parent is not the same as the original recorded one that means an inplace updates have been applied to the parent"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -726,10 +760,10 @@ def name(self, value: Hashable) -> None:
if not is_hashable(value):
raise TypeError(f"{type(self).__name__}.name must be a hashable type")
self._query_compiler = self._query_compiler.set_index_names([value])
# Update the name of the parent's index only if the parent's current query compiler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the doc for the name function to make it clear that the inplace replacement only works when no inplace update has been applied to parent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@sfc-gh-vbudati sfc-gh-vbudati merged commit e93cd68 into main Sep 17, 2024
35 checks passed
@sfc-gh-vbudati sfc-gh-vbudati deleted the vbudati/SNOW-1661142-fix-index-name-behavior branch September 17, 2024 01:09
@github-actions github-actions bot locked and limited conversation to collaborators Sep 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants