-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add index_of()
function to Series
and Expr
#19894
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19894 +/- ##
==========================================
+ Coverage 78.95% 79.00% +0.04%
==========================================
Files 1564 1566 +2
Lines 220882 221035 +153
Branches 2510 2510
==========================================
+ Hits 174407 174619 +212
+ Misses 45900 45842 -58
+ Partials 575 574 -1 ☔ View full report in Codecov by Sentry. |
I think I've figured out how to use row encoding, so now I just need to write lots and lots of tests and make sure it actually works beyond the trivial case I've already tested. |
Unfortunately categorical and enum don't work (they also don't work for E.g. for Categorical: >>> import polars as pl
>>> pl.Series(["a", "b", "a"], dtype=pl.Categorical).index_of("a")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 4771, in index_of
return F.select(F.lit(self).index_of(element)).item()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/functions/lazy.py", line 1913, in select
return pl.DataFrame().select(*exprs, **named_exprs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 9113, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/lazyframe/frame.py", line 2029, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: got invalid or ambiguous dtypes: '[cat, str]' in expression 'index_of'
Consider explicitly casting your input types to resolve potential ambiguity.
Resolved plan until failure:
---> FAILED HERE RESOLVING 'select' <---
SELECT [Series.index_of([String(a)])] FROM
DF []; PROJECT */0 COLUMNS; SELECTION: None |
My guess is that you are treating a categorical as a string when it goes into the row encoding. If you want to compare the row encoding of a series with the row encoding of another series they need to have been encoded with the exact same dtype (i.e. so the same RevMap as well) otherwise the output is undefined. If search_sorted doesn't do that either, that is a bug and I can look into it. |
@coastalwhite |
@coastalwhite and the question is how/where do I convert to an enum/categorical, my attempts have failed so far. |
Thank you for the new casting logic! I've updated to use it, and addressed the other two comments. |
Alright, looks great @itamarst. Thanks. I believe we only need docs entries on the python side (so that they end up in the ref guide), then it is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the tiny user-guide page? It's pretty much the same as the docstrings, so I feel like it's enough to have the docstrings.
index_of()
function to Series
and Expr
OK, I figured out how to add |
Alright, can you rebase? I believe that that should resolve CI. |
Done. |
Alright, thanks @itamarst, looks good! |
Fixes #5503
Categoricals don't work yet; see #20171 and #20318.